StreamSTGS: Real-Time FVV Compression
- StreamSTGS is a compression-oriented neural representation for free-viewpoint video that decouples scenes into static (canonical 3D Gaussians) and dynamic (temporal features and deformation) components.
- It employs a hybrid image/video coding scheme, using lossless JPEG-XL for canonical parameters and H.264/HEVC for temporal features to enable adaptive bitrate streaming.
- Experimental results show improved PSNR, reduced frame sizes, and faster decoding compared to prior 3D Gaussian splatting methods.
StreamSTGS is a compression-oriented neural representation designed for real-time streaming of free-viewpoint video (FVV) using the principles of 3D Gaussian Splatting (3DGS). It addresses the storage and transmission bottlenecks of prior 3DGS-based FVV approaches, enabling high-quality video at adaptive bitrates that are orders of magnitude lower than previous methods. StreamSTGS achieves this by decoupling static and dynamic scene elements, structuring scene content as canonical 3D Gaussians, temporal feature grids, and a compact learnable deformation field, and leveraging a hybrid image/video coding scheme suitable for highly efficient streaming and decoding.
1. Scene Representation and Parametrization
StreamSTGS models a dynamic scene over a Group-of-Pictures (GOP) as three decoupled components:
- Canonical 3D Gaussian Set : Each Gaussian is defined by center , diagonal scale , rotation quaternion (defining covariance ), opacity , and canonical color (rectified by ReLU).
- Temporal Feature Grids : Each encodes the scene's frame-dependent motion and appearance features across time.
- Deformation Field (MLPs): For time , a window of consecutive temporal feature grids () is input to a temporal MLP to produce a motion feature for each Gaussian. This motion feature conditions all attribute deltas:
Per-frame, the deformed Gaussians are: Rendering is performed via analytic projection and splatting of each anisotropic Gaussian, composited by alpha blending in depth order.
2. Mathematical Principles and Splatting Pipeline
Each anisotropic Gaussian is parametrized as:
- Mean:
- Covariance:
- Opacity:
- Color:
Projection to the image plane under camera intrinsics and pose is given by: Rendered color for each pixel :
where is proportional to the Gaussian's projected density.
3. Hybrid Compression and Streaming Pipeline
To enable low-bitrate, real-time streaming and adaptive bitrate support, StreamSTGS encodes:
- Canonical Gaussian Parameters as Images:
- The Gaussians are sorted into a grid by the PLAS algorithm.
- Five images are produced: positions (3 ch.), scales (3 ch.), rotations (4 ch.), opacities (1 ch.), colors (3 ch.).
- Rotations quantized with in , scales and opacities with in ; positions are stored in float32.
- All five images are losslessly packed (JPEG-XL).
- Spatial regularization ensures compressibility: .
- Temporal Features as Video:
- Each is "unfolded" to to match video codec constraints.
- Temporal feature sequences are encoded with H.264/HEVC (e.g., libx265, QP=20).
- Temporal smoothness is regularized: .
- Adaptive Bitrate Streaming: Since static attributes are images and temporal features a video, standard ABR protocols (DASH, HLS) can be used without retraining.
4. Temporal Dynamics: Sliding-Window and Transformer Training
To capture both local and global motion:
- Sliding-Window Motion:
- Window of size (typically ) over temporal grids forms .
- with positional encoding .
- Pseudocode:
1 2 3 4 5 6 7 8 |
for i in [0..G]: fe = concatenate(e[i-1], e[i], e[i+1]) f_i = D_t(fe, positional_encode(t_i)) ΔX, ΔS, ΔQ = D_v(f_i), D_cov(f_i) ΔO = tanh(D_o(f_i)) ΔC = D_c(f_i) X_i, S_i, Q_i, O_i, C_i = X+ΔX, S+ΔS, Q+ΔQ, O+ΔO, ReLU(C)+ΔC I_i = render_splat({X_i, S_i, Q_i, O_i, C_i}) |
- Transformer-Guided Auxiliary Branch:
- To model global (long-range) motion, a "TimeFormer" (2-layer Transformer encoder, 2 heads, hidden=64) processes together with temporal and spatial positional encodings:
- Two-pass training: 1. Gaussian pass: original pipeline produces 2. Auxiliary pass: goes through same decoders for - Distillation and image quality losses:
5. Implementation, Hyperparameters, and Streaming Operation
Pretraining: Begin with a "coarse" static 3DGS across all views (≈3,000 iters, batch=2).
GOP-wise Refinement: GOP size=60; 12k iters for N3DV, 7k for MeetRoom.
Hyperparameters:
- Regularization scales: (sim. feature quantization), , ,
- Dynamic-aware density is activated after 5k iters (N3DV), 3k (MeetRoom).
- Network Details: MLPs for (1×Linear(64)+Tanh), (2×Linear(64)+Tanh, no bias). Learning rates per MLP, e.g., from 0.005 to . Transformer lr 2e–3→1e–5, normalized to [0,1] per GOP.
- Streaming Procedure: Five attribute images for the upcoming GOP are pre-decoded; temporal feature video is decoded in real-time (8 ms/frame).
6. Experimental Results and Comparative Performance
Benchmarks on N3DV (6 scenes, ≈60 fps @ 1352×1014) and MeetRoom (12 cams, 1280×720):
| Metric | N3DV (StreamSTGS) | N3DV (4DGC) | N3DV (HiCoM) | MeetRoom (StreamSTGS) | MeetRoom (4DGC) |
|---|---|---|---|---|---|
| PSNR (dB) | 32.30 | 31.52 | 31.32 | 27.41 | 27.11 |
| SSIM | 0.943 | (n/a) | (n/a) | (n/a) | (n/a) |
| LPIPS (↓) | 0.147 | (n/a) | (n/a) | (n/a) | (n/a) |
| Avg. frame size | 174 KB | 784 KB | (n/a) | 142 KB | 1.2 MB |
| Key-frame size | ≈3.9 MB | (n/a) | (n/a) | (n/a) | (n/a) |
| Decode time (ms) | 8 | (n/a) | (n/a) | 6 | (n/a) |
| Render time (ms) | 10 (100 FPS) | (n/a) | (n/a) | 7.9 (126 FPS) | (n/a) |
| Training (per GOP) | 67 s | (n/a) | (n/a) | (n/a) | (n/a) |
StreamSTGS achieves a 1 dB PSNR gain over 4DGC, reduces average frame size to 174 KB (N3DV) / 142 KB (MeetRoom) -- a 4× and 8× reduction over 4DGC, and more than 50× lower than prior 3DGStream.
7. Limitations and Future Directions
In each GOP, many Gaussians may be static yet still ingest a full -frame temporal feature payload, leading to bandwidth waste. Future directions include classifying Gaussians as static or dynamic and only streaming temporal features for the dynamic subset, which is expected to further reduce storage and computation overhead while increasing attainable frame rates.
8. Significance and Context Within Gaussian Splatting Research
StreamSTGS targets the fundamental limitation of 3DGS in streaming settings—high per-frame storage (~10 MB in previous works)—by designing an efficient hybrid image/video compression, structured representation of static and dynamic content, and an MLP-based deformation field. Compared to previous 3DGS-based FVV systems, it offers real-time streaming and low-bitrate adaptation without retraining, maintaining competitive or better perceptual metrics. This architecture positions StreamSTGS as a practical solution for real-world FVV deployments over bandwidth-constrained networks, providing adaptability for AR/VR, telepresence, and immersive media scenarios (Ke et al., 8 Nov 2025).