Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

StreamSTGS: Real-Time FVV Compression

Updated 12 November 2025
  • StreamSTGS is a compression-oriented neural representation for free-viewpoint video that decouples scenes into static (canonical 3D Gaussians) and dynamic (temporal features and deformation) components.
  • It employs a hybrid image/video coding scheme, using lossless JPEG-XL for canonical parameters and H.264/HEVC for temporal features to enable adaptive bitrate streaming.
  • Experimental results show improved PSNR, reduced frame sizes, and faster decoding compared to prior 3D Gaussian splatting methods.

StreamSTGS is a compression-oriented neural representation designed for real-time streaming of free-viewpoint video (FVV) using the principles of 3D Gaussian Splatting (3DGS). It addresses the storage and transmission bottlenecks of prior 3DGS-based FVV approaches, enabling high-quality video at adaptive bitrates that are orders of magnitude lower than previous methods. StreamSTGS achieves this by decoupling static and dynamic scene elements, structuring scene content as canonical 3D Gaussians, temporal feature grids, and a compact learnable deformation field, and leveraging a hybrid image/video coding scheme suitable for highly efficient streaming and decoding.

1. Scene Representation and Parametrization

StreamSTGS models a dynamic scene over a Group-of-Pictures (GOP) as three decoupled components:

  • Canonical 3D Gaussian Set G\mathcal{G}: Each Gaussian gmg_m is defined by center XR3X \in \mathbb{R}^3, diagonal scale SR3S \in \mathbb{R}^3, rotation quaternion QR4Q \in \mathbb{R}^4 (defining covariance Σ=RSSR\Sigma = R S S^\top R^\top), opacity O[0,1]O \in [0,1], and canonical color CC (rectified by ReLU).
  • Temporal Feature Grids E={e0,,eG+W1}\mathcal{E} = \{e_0, \dots, e_{G+W-1}\}: Each eiRh×w×Ce_i \in \mathbb{R}^{h \times w \times C} encodes the scene's frame-dependent motion and appearance features across time.
  • Deformation Field (MLPs): For time tit_i, a window of WW consecutive temporal feature grids (fei=concat(ei(W1)/2,,ei+(W1)/2)fe_i = \operatorname{concat}(e_{i-\lfloor (W-1)/2 \rfloor}, \dots, e_{i+\lfloor (W-1)/2 \rfloor})) is input to a temporal MLP DtD_t to produce a motion feature fif_i for each Gaussian. This motion feature conditions all attribute deltas:
    • ΔX,ΔS,ΔQDv,Dcov(fi)\Delta X, \Delta S, \Delta Q \leftarrow D_v, D_{\mathrm{cov}}(f_i)
    • ΔOtanh(Do(fi,view))\Delta O \leftarrow \tanh(D_o(f_i, \mathit{view}))
    • ΔCDc(fi,view)\Delta C \leftarrow D_c(f_i, \mathit{view})

Per-frame, the deformed Gaussians are: Xi=X+ΔX,Si=S+ΔS,Qi=Q+ΔQ,Oi=O+ΔO,Ci=ReLU(C)+ΔCX_i = X + \Delta X,\quad S_i = S + \Delta S,\quad Q_i = Q + \Delta Q,\quad O_i = O + \Delta O,\quad C_i = \mathrm{ReLU}(C) + \Delta C Rendering is performed via analytic projection and splatting of each anisotropic Gaussian, composited by alpha blending in depth order.

2. Mathematical Principles and Splatting Pipeline

Each anisotropic Gaussian mm is parametrized as:

  • Mean: XmR3X_m \in \mathbb{R}^3
  • Covariance: Σm=RmSmSmRm\Sigma_m = R_m S_m S_m^\top R_m^\top
  • Opacity: OmO_m
  • Color: cmc_m

Projection to the image plane under camera intrinsics KK and pose WW is given by: X=KWX(WX)z,Σ=JWΣWJ,J=(WX)(WX)zX' = K \frac{WX}{(WX)_z},\quad \Sigma' = J W \Sigma W^\top J^\top,\quad J = \frac{\partial (WX)}{\partial (WX)_z} Rendered color for each pixel xx: color(x)=mN(x)cmαmj<m(1αj)\operatorname{color}(x) = \sum_{m \in N(x)} c_m \alpha_m \prod_{j < m} (1 - \alpha_j)

αm=1exp(Omd)\alpha_m = 1 - \exp(-O_m \cdot d)

where dd is proportional to the Gaussian's projected density.

3. Hybrid Compression and Streaming Pipeline

To enable low-bitrate, real-time streaming and adaptive bitrate support, StreamSTGS encodes:

  • Canonical Gaussian Parameters as Images:
    • The MM Gaussians are sorted into a H×WH \times W grid by the PLAS algorithm.
    • Five images are produced: positions (3 ch.), scales (3 ch.), rotations (4 ch.), opacities (1 ch.), colors (3 ch.).
    • Rotations quantized with qr=27q_r = 2^7 in [1,2][-1,2], scales and opacities with qs=qo=26q_s = q_o = 2^6 in [4,4][-4,4]; positions are stored in [0,1][0,1] float32.
    • All five images are losslessly packed (JPEG-XL).
    • Spatial regularization ensures compressibility: Lspatial=Gσ(Iattr)Iattr2L_\mathrm{spatial} = \|\mathcal{G}_\sigma(I_\mathrm{attr}) - I_\mathrm{attr}\|_2.
  • Temporal Features as Video:
    • Each eie_i is "unfolded" to 4h×4w4h \times 4w to match video codec constraints.
    • Temporal feature sequences are encoded with H.264/HEVC (e.g., libx265, QP=20).
    • Temporal smoothness is regularized: Ltemp=mean(Huber(ei1ei),Huber(eiei+1))L_\mathrm{temp} = \mathrm{mean}(\mathrm{Huber}(e_{i-1} - e_i), \mathrm{Huber}(e_i - e_{i+1})).
  • Adaptive Bitrate Streaming: Since static attributes are images and temporal features a video, standard ABR protocols (DASH, HLS) can be used without retraining.

4. Temporal Dynamics: Sliding-Window and Transformer Training

To capture both local and global motion:

  • Sliding-Window Motion:
    • Window of size WW (typically W=3W=3) over temporal grids forms feife_i.
    • fi=Dt(fei,γ(ti))f_i = D_t(fe_i, \gamma(t_i)) with positional encoding γ(ti)\gamma(t_i).
    • Pseudocode:

1
2
3
4
5
6
7
8
for i in [0..G]:
    fe = concatenate(e[i-1], e[i], e[i+1])
    f_i = D_t(fe, positional_encode(t_i))
    ΔX, ΔS, ΔQ = D_v(f_i), D_cov(f_i)
    ΔO = tanh(D_o(f_i))
    ΔC = D_c(f_i)
    X_i, S_i, Q_i, O_i, C_i = X+ΔX, S+ΔS, Q+ΔQ, O+ΔO, ReLU(C)+ΔC
    I_i = render_splat({X_i, S_i, Q_i, O_i, C_i})

  • Transformer-Guided Auxiliary Branch:
    • To model global (long-range) motion, a "TimeFormer" (2-layer Transformer encoder, 2 heads, hidden=64) processes fif_i together with temporal and spatial positional encodings:

    fi=F(fi,γ(ti),γ(X))f_i' = \mathcal{F}(f_i, \gamma(t_i), \gamma(X)) - Two-pass training: 1. Gaussian pass: original pipeline produces IiI_i 2. Auxiliary pass: fif_i' goes through same decoders for IiI_i' - Distillation and image quality losses:

    Lsd=fifi1,Lt=SSIM(Ii,Iigt)L_{\mathrm{sd}} = \|f_i - f_i'\|_1,\quad L_t = \mathrm{SSIM}(I_i', I_i^{\mathrm{gt}})

5. Implementation, Hyperparameters, and Streaming Operation

  • Pretraining: Begin with a "coarse" static 3DGS across all views (≈3,000 iters, batch=2).

  • GOP-wise Refinement: GOP size=60; 12k iters for N3DV, 7k for MeetRoom.

  • Hyperparameters:

    • Regularization scales: λ=0.001\lambda=0.001 (sim. feature quantization), αtemp=1.0\alpha_{\mathrm{temp}}=1.0, αo=0.01\alpha_o=0.01, αsd=0.005\alpha_{\mathrm{sd}}=0.005
    • Dynamic-aware density is activated after 5k iters (N3DV), 3k (MeetRoom).
  • Network Details: MLPs for DtD_t (1×Linear(64)+Tanh), Dv,Dcov,Do,DcD_v,D_{\mathrm{cov}},D_o,D_c (2×Linear(64)+Tanh, no bias). Learning rates per MLP, e.g., DvD_v from 0.005 to 5×1055\times 10^{-5}. Transformer lr 2e–3→1e–5, tit_i normalized to [0,1] per GOP.
  • Streaming Procedure: Five attribute images for the upcoming GOP are pre-decoded; temporal feature video is decoded in real-time (\approx8 ms/frame).

6. Experimental Results and Comparative Performance

Benchmarks on N3DV (6 scenes, ≈60 fps @ 1352×1014) and MeetRoom (12 cams, 1280×720):

Metric N3DV (StreamSTGS) N3DV (4DGC) N3DV (HiCoM) MeetRoom (StreamSTGS) MeetRoom (4DGC)
PSNR (dB) 32.30 31.52 31.32 27.41 27.11
SSIM 0.943 (n/a) (n/a) (n/a) (n/a)
LPIPS (↓) 0.147 (n/a) (n/a) (n/a) (n/a)
Avg. frame size 174 KB 784 KB (n/a) 142 KB 1.2 MB
Key-frame size ≈3.9 MB (n/a) (n/a) (n/a) (n/a)
Decode time (ms) 8 (n/a) (n/a) 6 (n/a)
Render time (ms) 10 (100 FPS) (n/a) (n/a) 7.9 (126 FPS) (n/a)
Training (per GOP) 67 s (n/a) (n/a) (n/a) (n/a)

StreamSTGS achieves a \sim1 dB PSNR gain over 4DGC, reduces average frame size to \sim174 KB (N3DV) / 142 KB (MeetRoom) -- a 4× and 8× reduction over 4DGC, and more than 50× lower than prior 3DGStream.

7. Limitations and Future Directions

In each GOP, many Gaussians may be static yet still ingest a full WW-frame temporal feature payload, leading to bandwidth waste. Future directions include classifying Gaussians as static or dynamic and only streaming temporal features for the dynamic subset, which is expected to further reduce storage and computation overhead while increasing attainable frame rates.

8. Significance and Context Within Gaussian Splatting Research

StreamSTGS targets the fundamental limitation of 3DGS in streaming settings—high per-frame storage (~10 MB in previous works)—by designing an efficient hybrid image/video compression, structured representation of static and dynamic content, and an MLP-based deformation field. Compared to previous 3DGS-based FVV systems, it offers real-time streaming and low-bitrate adaptation without retraining, maintaining competitive or better perceptual metrics. This architecture positions StreamSTGS as a practical solution for real-world FVV deployments over bandwidth-constrained networks, providing adaptability for AR/VR, telepresence, and immersive media scenarios (Ke et al., 8 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to StreamSTGS.