StreamSTGS: Real-Time FVV Compression

Updated 12 November 2025

StreamSTGS is a compression-oriented neural representation for free-viewpoint video that decouples scenes into static (canonical 3D Gaussians) and dynamic (temporal features and deformation) components.
It employs a hybrid image/video coding scheme, using lossless JPEG-XL for canonical parameters and H.264/HEVC for temporal features to enable adaptive bitrate streaming.
Experimental results show improved PSNR, reduced frame sizes, and faster decoding compared to prior 3D Gaussian splatting methods.

StreamSTGS is a compression-oriented neural representation designed for real-time streaming of free-viewpoint video (FVV) using the principles of 3D Gaussian Splatting (3DGS). It addresses the storage and transmission bottlenecks of prior 3DGS-based FVV approaches, enabling high-quality video at adaptive bitrates that are orders of magnitude lower than previous methods. StreamSTGS achieves this by decoupling static and dynamic scene elements, structuring scene content as canonical 3D Gaussians, temporal feature grids, and a compact learnable deformation field, and leveraging a hybrid image/video coding scheme suitable for highly efficient streaming and decoding.

1. Scene Representation and Parametrization

StreamSTGS models a dynamic scene over a Group-of-Pictures (GOP) as three decoupled components:

Canonical 3D Gaussian Set $\mathcal{G}$ : Each Gaussian $g_m$ is defined by center $X \in \mathbb{R}^3$ , diagonal scale $S \in \mathbb{R}^3$ , rotation quaternion $Q \in \mathbb{R}^4$ (defining covariance $\Sigma = R S S^\top R^\top$ ), opacity $O \in [0,1]$ , and canonical color $C$ (rectified by ReLU).
Temporal Feature Grids $\mathcal{E} = \{e_0, \dots, e_{G+W-1}\}$ : Each $e_i \in \mathbb{R}^{h \times w \times C}$ encodes the scene's frame-dependent motion and appearance features across time.
Deformation Field (MLPs): For time $t_i$ $t_{i}$ , a window of $W$ $W$ consecutive temporal feature grids ( $fe_i = \operatorname{concat}(e_{i-\lfloor (W-1)/2 \rfloor}, \dots, e_{i+\lfloor (W-1)/2 \rfloor})$ ) is input to a temporal MLP $D_t$ $D_{t}$ to produce a motion feature $f_i$ $f_{i}$ for each Gaussian. This motion feature conditions all attribute deltas:
- $\Delta X, \Delta S, \Delta Q \leftarrow D_v, D_{\mathrm{cov}}(f_i)$
- $\Delta O \leftarrow \tanh(D_o(f_i, \mathit{view}))$
- $\Delta C \leftarrow D_c(f_i, \mathit{view})$

Per-frame, the deformed Gaussians are: $X_i = X + \Delta X,\quad S_i = S + \Delta S,\quad Q_i = Q + \Delta Q,\quad O_i = O + \Delta O,\quad C_i = \mathrm{ReLU}(C) + \Delta C$ Rendering is performed via analytic projection and splatting of each anisotropic Gaussian, composited by alpha blending in depth order.

2. Mathematical Principles and Splatting Pipeline

Each anisotropic Gaussian $m$ is parametrized as:

Mean: $X_m \in \mathbb{R}^3$
Covariance: $\Sigma_m = R_m S_m S_m^\top R_m^\top$
Opacity: $O_m$
Color: $c_m$

Projection to the image plane under camera intrinsics $K$ and pose $W$ is given by: $X' = K \frac{WX}{(WX)_z},\quad \Sigma' = J W \Sigma W^\top J^\top,\quad J = \frac{\partial (WX)}{\partial (WX)_z}$ Rendered color for each pixel $x$ : $\operatorname{color}(x) = \sum_{m \in N(x)} c_m \alpha_m \prod_{j < m} (1 - \alpha_j)$

$\alpha_m = 1 - \exp(-O_m \cdot d)$

where $d$ is proportional to the Gaussian's projected density.

3. Hybrid Compression and Streaming Pipeline

To enable low-bitrate, real-time streaming and adaptive bitrate support, StreamSTGS encodes:

Canonical Gaussian Parameters as Images:
- The $M$ Gaussians are sorted into a $H \times W$ grid by the PLAS algorithm.
- Five images are produced: positions (3 ch.), scales (3 ch.), rotations (4 ch.), opacities (1 ch.), colors (3 ch.).
- Rotations quantized with $q_r = 2^7$ in $[-1,2]$ , scales and opacities with $q_s = q_o = 2^6$ in $[-4,4]$ ; positions are stored in $[0,1]$ float32.
- All five images are losslessly packed (JPEG-XL).
- Spatial regularization ensures compressibility: $L_\mathrm{spatial} = \|\mathcal{G}_\sigma(I_\mathrm{attr}) - I_\mathrm{attr}\|_2$ .
Temporal Features as Video:
- Each $e_i$ is "unfolded" to $4h \times 4w$ to match video codec constraints.
- Temporal feature sequences are encoded with H.264/HEVC (e.g., libx265, QP=20).
- Temporal smoothness is regularized: $L_\mathrm{temp} = \mathrm{mean}(\mathrm{Huber}(e_{i-1} - e_i), \mathrm{Huber}(e_i - e_{i+1}))$ .
Adaptive Bitrate Streaming: Since static attributes are images and temporal features a video, standard ABR protocols (DASH, HLS) can be used without retraining.

4. Temporal Dynamics: Sliding-Window and Transformer Training

To capture both local and global motion:

Sliding-Window Motion:
- Window of size $W$ (typically $W=3$ ) over temporal grids forms $fe_i$ .
- $f_i = D_t(fe_i, \gamma(t_i))$ with positional encoding $\gamma(t_i)$ .
- Pseudocode:

for i in [0..G]:
    fe = concatenate(e[i-1], e[i], e[i+1])
    f_i = D_t(fe, positional_encode(t_i))
    ΔX, ΔS, ΔQ = D_v(f_i), D_cov(f_i)
    ΔO = tanh(D_o(f_i))
    ΔC = D_c(f_i)
    X_i, S_i, Q_i, O_i, C_i = X+ΔX, S+ΔS, Q+ΔQ, O+ΔO, ReLU(C)+ΔC
    I_i = render_splat({X_i, S_i, Q_i, O_i, C_i})

Transformer-Guided Auxiliary Branch:
- To model global (long-range) motion, a "TimeFormer" (2-layer Transformer encoder, 2 heads, hidden=64) processes $f_i$ together with temporal and spatial positional encodings:
$f_i' = \mathcal{F}(f_i, \gamma(t_i), \gamma(X))$ - Two-pass training: 1. Gaussian pass: original pipeline produces $I_i$ 2. Auxiliary pass: $f_i'$ goes through same decoders for $I_i'$ - Distillation and image quality losses:

$L_{\mathrm{sd}} = \|f_i - f_i'\|_1,\quad L_t = \mathrm{SSIM}(I_i', I_i^{\mathrm{gt}})$

5. Implementation, Hyperparameters, and Streaming Operation

Pretraining: Begin with a "coarse" static 3DGS across all views (≈3,000 iters, batch=2).
GOP-wise Refinement: GOP size=60; 12k iters for N3DV, 7k for MeetRoom.
Hyperparameters:
- Regularization scales: $\lambda=0.001$ (sim. feature quantization), $\alpha_{\mathrm{temp}}=1.0$ , $\alpha_o=0.01$ , $\alpha_{\mathrm{sd}}=0.005$
- Dynamic-aware density is activated after 5k iters (N3DV), 3k (MeetRoom).
Network Details: MLPs for $D_t$ (1×Linear(64)+Tanh), $D_v,D_{\mathrm{cov}},D_o,D_c$ (2×Linear(64)+Tanh, no bias). Learning rates per MLP, e.g., $D_v$ from 0.005 to $5\times 10^{-5}$ . Transformer lr 2e–3→1e–5, $t_i$ normalized to [0,1] per GOP.
Streaming Procedure: Five attribute images for the upcoming GOP are pre-decoded; temporal feature video is decoded in real-time ( $\approx$ 8 ms/frame).

6. Experimental Results and Comparative Performance

Benchmarks on N3DV (6 scenes, ≈60 fps @ 1352×1014) and MeetRoom (12 cams, 1280×720):

Metric	N3DV (StreamSTGS)	N3DV (4DGC)	N3DV (HiCoM)	MeetRoom (StreamSTGS)	MeetRoom (4DGC)
PSNR (dB)	32.30	31.52	31.32	27.41	27.11
SSIM	0.943	(n/a)	(n/a)	(n/a)	(n/a)
LPIPS (↓)	0.147	(n/a)	(n/a)	(n/a)	(n/a)
Avg. frame size	174 KB	784 KB	(n/a)	142 KB	1.2 MB
Key-frame size	≈3.9 MB	(n/a)	(n/a)	(n/a)	(n/a)
Decode time (ms)	8	(n/a)	(n/a)	6	(n/a)
Render time (ms)	10 (100 FPS)	(n/a)	(n/a)	7.9 (126 FPS)	(n/a)
Training (per GOP)	67 s	(n/a)	(n/a)	(n/a)	(n/a)

StreamSTGS achieves a $\sim$ 1 dB PSNR gain over 4DGC, reduces average frame size to $\sim$ 174 KB (N3DV) / 142 KB (MeetRoom) -- a 4× and 8× reduction over 4DGC, and more than 50× lower than prior 3DGStream.

7. Limitations and Future Directions

In each GOP, many Gaussians may be static yet still ingest a full $W$ -frame temporal feature payload, leading to bandwidth waste. Future directions include classifying Gaussians as static or dynamic and only streaming temporal features for the dynamic subset, which is expected to further reduce storage and computation overhead while increasing attainable frame rates.

8. Significance and Context Within Gaussian Splatting Research

StreamSTGS targets the fundamental limitation of 3DGS in streaming settings—high per-frame storage (~10 MB in previous works)—by designing an efficient hybrid image/video compression, structured representation of static and dynamic content, and an MLP-based deformation field. Compared to previous 3DGS-based FVV systems, it offers real-time streaming and low-bitrate adaptation without retraining, maintaining competitive or better perceptual metrics. This architecture positions StreamSTGS as a practical solution for real-world FVV deployments over bandwidth-constrained networks, providing adaptability for AR/VR, telepresence, and immersive media scenarios (Ke et al., 8 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video (2025)

Follow Topic

Get notified by email when new papers are published related to StreamSTGS.