StreamSplat: Online 3D Gaussian Splatting

Updated 11 June 2026

StreamSplat is an online approach for 3D Gaussian Splatting that processes uncalibrated video streams frame-by-frame, enabling real-time dynamic scene reconstruction.
It employs probabilistic encoding and dynamic token fusion to maintain temporal coherence and robust geometry without relying on full video buffering or global optimization.
The method integrates memory-efficient redundancy compression and selective fusion strategies to achieve bounded GPU memory usage and near real-time performance.

StreamSplat refers to a class of online, real-time 3D Gaussian Splatting (3DGS) methodologies capable of incrementally reconstructing dynamic or static 3D scene representations from long, potentially uncalibrated video streams or image sequences. These approaches are distinguished by their streaming architectures, probabilistic formulations, and memory/pruning mechanisms that avoid full video buffering or global optimization—enabling efficient, theoretically unbounded video processing (Wu et al., 10 Jun 2025, Huang et al., 22 Jul 2025).

1. StreamSplat: Definition and Core Problem

StreamSplat designates the first fully feed-forward family of methods that process uncalibrated monocular (or multi-view) video streams of arbitrary length, producing dynamic 3DGS representations suitable for rendering or further analysis at any time. These frameworks operate online—processing one frame at a time and updating a global or locally maintained set of Gaussians—without offline bundle adjustment, per-scene optimization, or full video storage (Wu et al., 10 Jun 2025, Huang et al., 22 Jul 2025). The broad goal is robust, temporally consistent, and computationally efficient 3D reconstruction of evolving scenes, supporting both novel view synthesis and geometry recovery under streaming constraints.

2. Streaming 3DGS Pipeline: Principal Architectures

Canonical StreamSplat pipelines follow a feed-forward, incremental pattern. At time $t$ :

For each input frame $I_t$ (potentially with unknown camera parameters), a pseudo-depth estimate $D_t$ is predicted.
A static encoder fuses $I_t$ and $D_t$ to generate a set of 3D Gaussian tokens, transformed into a canonical space.
For modeling temporal dynamics, a dynamic decoder combines tokens from $I_t$ and cached representations (often including per-frame deep features such as DINOv2), predicting a deformation field that aligns scene content across time intervals.
The resulting Gaussians are adaptively blended or fused with overlapping past Gaussians using a soft, opacity-based rule, updating a current scene representation.
Only local token embeddings and features are cached, so memory scales with frame-wise data rather than full sequence length.

An alternative approach, as realized in LongSplat, maintains a global set of Gaussians $\mathcal{G}^g_t$ that is repeatedly updated using both new image-derived proposals and projections of historical Gaussians (cf. Gaussian-Image Representation, below). Historical elements are pruned if marked redundant, maintaining bounded memory (Huang et al., 22 Jul 2025).

3. Key Innovations: Probabilistic Encoding and Temporal Coherence

A central property is probabilistic encoding of 3D Gaussian parameters, especially for handling inherent scene/observation ambiguities:

Static encoders analyze each frame using transformer-based architectures to regress Gaussian parameters (position, orientation, scale, opacity, color) not as deterministic values but as draws from a truncated normal distribution $o \sim \mathcal{N}_{(-1,1)}(\mu_p, \Sigma_p)$ . This formulation allows the model to explore multiple plausible depth/layout explanations early in training, reducing local minima and improving final spatial assignments (Wu et al., 10 Jun 2025).
Dynamics are preserved by bidirectional deformation fields: forward and backward alignment of Gaussian parameters, parameterized as per-token velocity vectors and opacity coefficients.
Opacity blending across frames enables soft fusion without hard point-matching or explicit correspondence, mitigating blur, drift, and over-aggregation typical in prior frame-wise or global streaming optimization schemes.
For even higher scalability, LongSplat adopts an explicit 2D Gaussian-Image Representation (GIR), which encodes all 3DGS parameters in an image-aligned 2D grid for the current view, facilitating identity tracking, efficient 2D fusion, and redundancy-aware pruning (Huang et al., 22 Jul 2025).

4. Redundancy Compression and Memory Management

Online 3DGS leads to potential unbounded growth of the Gaussian set. StreamSplat models achieve constant memory growth through two mechanisms:

Identity-Aware Redundancy Compression: Each frame, historical Gaussians are rendered/projected into the current view; fusion networks predict per-pixel update masks and weights (via sigmoid activations and thresholds). A hard or soft mask prunes historical Gaussians redundant with current predictions, using criteria such as 3D IoU computed via Oriented Bounding Boxes.
Selective Fusion/Replacement: At the mask level, new proposals are merged with survivors from the previous states. This supports both soft updates—using per-pixel weights to blend old and new parameters—and hard replacements for entire Gaussians.
For further compression without quality loss, supervision can be borrowed from highly compressed static Gaussian sets (e.g., LightGaussian), guiding network training with photometric, geometric, and mask-driven losses (Huang et al., 22 Jul 2025).

This architecture enables $O(1)$ time per-frame updates, and in practice, runtime per frame typically falls in the 30–60 ms range on modern GPUs, supporting real-time or interactive rates (Huang et al., 22 Jul 2025).

5. Quantitative Performance and Experimental Findings

StreamSplat and its variants have demonstrated state-of-the-art results on both static and dynamic benchmarks. Several key metrics and results follow:

On RealEstate10K (static views), StreamSplat achieves PSNR = 29.51, SSIM = 0.839, LPIPS = 0.122, outperforming all dynamic baselines and matching or exceeding static methods while relying only on uncalibrated video (Wu et al., 10 Jun 2025).
On DAVIS/YouTube-VOS (dynamic), StreamSplat runs at ~1.48s/frame, with PSNR = 37.83, SSIM = 0.982, LPIPS = 0.016, yielding sharp geometry and temporally coherent motion superior to prior methods that require much slower (≥30s to minutes per frame) per-scene optimization (Wu et al., 10 Jun 2025).
For online memory efficiency, LongSplat reduces active Gaussian counts by 44% (compression ratio) with negligible LPIPS degradation (~0.17) and maintains constant GPU memory as sequence length increases (Huang et al., 22 Jul 2025).
Key metrics: Compression Ratio (percentage of Gaussians pruned), runtime per frame, memory usage (typically sublinear in video length due to adaptive pruning).

6. Limitations and Future Directions

Notable limitations remain:

All current StreamSplat/LongSplat versions depend on accurate pseudo-depth or external pose predictions. Depth errors around thin structures or at large discontinuities can introduce representation artifacts, partially mitigated by adaptive loss weighting (Wu et al., 10 Jun 2025). A plausible implication is that learning monocular depth as a native network component could further reduce such errors.
Temporal context is currently strictly local (e.g., two-frame windows or masked-update history): extended occlusions or very dynamic scenes might cause premature fading or memory loss of important Gaussians.
Maximum spatial resolution and runtime: present runtimes (1–1.5s per frame for StreamSplat, >20 FPS for LongSplat) approach real-time but fall short of high-resolution, full-frame rates on commodity hardware. Ongoing improvements in network efficiency and token/patch sampling could bridge this gap.
For strictly online pipelines, persistent drift due to iterative small errors (pose, depth, mask) or long-term scene changes may require richer memory (e.g., a compressed global key-set) or periodic re-alignment.

StreamSplat's feed-forward architecture contrasts with previous frame-level (Frame-Stream) or batch-based (Clip) 3DGS optimization, which either suffer from temporal instability (Frame-Stream) or poor scalability (Clip). Hybrid methods such as ClipGStream (Liang et al., 15 Apr 2026) offer a related, but distinct, paradigm by splitting sequences into short, partially optimized clips and inheriting core scene anchors and decoders across them. This achieves bounded memory, high temporal stability, and scalability on extremely long videos but typically relies on known camera poses and is optimized offline rather than in a fully online, per-frame manner.

A plausible implication is that future research may pursue:

Tight integration of monocular depth/pseudo-pose refinement within the online feed-forward network.
Dynamic memory and active compression for persistent, lifelong 3D representation without degradation on arbitrarily long video streams.
Unification of clip-level and frame-level streaming for optimal stability/efficiency trade-offs.

Method	Optimization granularity	Temporal context	Scalability	Real-time Capability
StreamSplat	Frame	Bidirectional (2 frames)	Arbitrary video	Near real-time
LongSplat	Frame	Markov + global redundancy	Arbitrary video	Real-time
ClipGStream	Clip (~50–300 frames)	Clip + inter-clip inheritance	Long video	Real-time per-clip

StreamSplat systems fundamentally advance practical dynamic 3D reconstruction and streaming view synthesis, jointly achieving robust performance, scalability, and computational frugality through probabilistic scene encoding and principled online pruning (Wu et al., 10 Jun 2025, Huang et al., 22 Jul 2025, Liang et al., 15 Apr 2026).