LongSplat: Scalable 3D Gaussian Splatting
- LongSplat is an online framework for scalable 3D Gaussian splatting that incrementally integrates long image sequences with joint camera pose and geometry optimization.
- It employs a novel Gaussian-Image Representation and adaptive redundancy compression via a streaming update mechanism to ensure high rendering fidelity and efficiency.
- The framework effectively handles unposed videos and large-scale scenes, enabling robust real-time novel view synthesis and dynamic scene reconstruction.
LongSplat is a framework for online, efficient, and scalable 3D Gaussian Splatting from long image sequences, addressing both real-time novel view synthesis and robust reconstruction from casual, unposed videos. It overcomes major limitations in prior methods by supporting incremental updates, adaptive redundancy compression, and joint camera pose/geometry optimization, making it suitable for dynamic and large-scale scene modeling. At its core, LongSplat combines a streaming update mechanism and a novel Gaussian-Image Representation to maintain high-quality reconstructions with low computational and memory overhead, achieving state-of-the-art results in rendering fidelity, pose accuracy, and efficiency.
1. Streaming Update Mechanism and Incremental Optimization
LongSplat’s operational paradigm is based on streaming updates, designed for long image sequences where traditional per-scene optimization is computationally prohibitive. Instead of reoptimizing the entire scene representation for each new frame, LongSplat incrementally integrates current observations into the global Gaussian model . For each time step :
- A multi-view spatial feature map is computed from the current frame and neighbors using the DepthSplat pipeline, capturing new geometric and appearance information.
- A historical feature map is generated by projecting previously accumulated Gaussians via a differentiable operator to obtain the Gaussian-Image Representation (GIR).
- These are fused (e.g., with a transformer module) into an enriched feature representation .
- From , the model predicts an update mask , controlling adaptive compression (pruning redundant Gaussians) and online integration (fusing new Gaussian predictions with history).
The global update is formulated as:
where is a fusion function implemented with a transformer. This mechanism ensures that only high-confidence updates replace historical representations, enabling efficient online reconstruction and continuous scene adaptation (Huang et al., 22 Jul 2025).
For unposed scenarios (casual long videos), the framework incorporates incremental joint optimization: new data is incorporated by estimating frame poses via learned 3D priors (from foundation models like MASt3R), then updating the 3DGS in a visibility-adapted local window. Periodic global optimizations refine both camera poses and Gaussian parameters to maintain global consistency and avoid local minima, with losses integrating photometric fidelity, depth alignment, and reprojection constraints:
2. Gaussian-Image Representation (GIR)
A key innovation of LongSplat is the Gaussian-Image Representation, which encodes the parameters of 3D Gaussians into a 2D image-like grid. Each pixel in GIR stores:
where is the projected position, is the vectorized upper-triangle of the covariance, is the opacity, and is a unique mapping to the source Gaussian.
GIR enables efficient fusion and computation:
- Grid-aligned 2D representation allows localized processing via 2D convolutions and transformers.
- Temporal consistency is maintained, as pixel-level unique IDs support tracking and compression of Gaussians across views.
- Facilitates identity-aware redundancy compression, enabling selective pruning of “floating” or outdated Gaussians.
Rendering strategies include:
- Nearest Rendering: First visible Gaussian where .
- Most-Contributive Rendering: Gaussian with maximal transmittance-weighted opacity.
This projection of 3D attributes to a structured 2D format simplifies both data fusion and historical compression (Huang et al., 22 Jul 2025).
3. Robust Pose Estimation and Optimization for Unposed Videos
For videos with unknown camera poses, LongSplat utilizes a robust Pose Estimation Module:
- Correspondences are formed between keypoints of consecutive frames and back-projected into 3D using learned depth priors:
- Perspective-n-Point (PnP) methods, often robustified with RANSAC, are employed for initial pose estimations, which are then refined via photometric loss minimization between the actual frame and its rendered proxy.
- Depth scale correction aligns render depth with priors. Newly visible regions are detected using occlusion masks and incorporated into the anchor scheme.
This two-pronged approach—correspondence-based initialization and photometric refinement—yields robust pose estimation even under irregular camera motion, mitigating pose drift and initialization inaccuracies (Lin et al., 19 Aug 2025).
4. Adaptive Octree Anchor Formation and Memory Efficiency
LongSplat employs an adaptive octree anchor formation to seed Gaussians in large-scale scenes:
- An initial dense point cloud is voxelized; high-density voxels are recursively split if density , halving voxel resolution up to a set depth.
- Low-density voxels () are discarded.
- Surviving voxels become anchors for Gaussians, with scale proportional to voxel size and position determined by offsets from the voxel center.
This spatial adaptation allows for high geometric fidelity in dense regions and reduced memory footprint in sparse areas, enabling scalable modeling without significant resource demands (Lin et al., 19 Aug 2025).
5. Efficiency, Quality, and Comparative Evaluation
LongSplat achieves high-efficiency and quality trade-offs via incremental updates, redundancy pruning, and GIR-enabled fusion. Quantitative results include:
- On DL3DV benchmarks, PSNR of 22.68 dB (12 views) and 23.71 dB (50 views) without compression, surpassing methods such as DepthSplat.
- In compression mode (LongSplat-c), Gaussian counts reduced by approximately 44%, with only minor declines in PSNR (21.34 dB at high view counts), maintaining superior quality over baseline methods under long sequences.
- Qualitative outputs show fewer artifacts (e.g., "floating" Gaussians, blurred surfaces) and sharper textures (Huang et al., 22 Jul 2025).
In unposed scenarios, LongSplat leads in PSNR, SSIM, LPIPS for rendering and ATE, RPE for pose accuracy. Experiments on Tanks and Temples, Free, and Hike datasets confirm improved sharpness, trajectory consistency, and faster training/FPS compared to scaffolded or SfM-based solutions (Lin et al., 19 Aug 2025).
6. Applications and Implications
LongSplat’s integration of online updates, joint pose/geometry optimization, and memory-efficient anchor formation opens applications in:
- Robotics, SLAM, Embodied AI: Real-time environmental modeling and continual updating are crucial for autonomous navigation and scene understanding.
- AR/VR: Low-latency, high-fidelity 3D reconstruction enables immersive and interactive experiences.
- Photorealistic Novel View Synthesis: Rapid, artifact-free scene rendering for film, gaming, or VFX workloads.
- Large-Scale Mapping: Supports long, dense trajectories without memory bottlenecks, foundational for urban mapping and environment reconstruction.
A plausible implication is that such frameworks may eventually supplant traditional SfM and fixed-grid approaches in real-time, unconstrained scene modeling, especially under challenging acquisition conditions.
7. Mathematical Foundations
LongSplat’s essential mathematical tools include:
- 3D Gaussian representation:
with as center, covariance (shape/orientation).
- Rendering via alpha-blending:
- Optimization objectives:
- Photometric:
- Depth alignment:
- Reprojection:
- Octree splitting criterion:
This mathematical foundation underpins LongSplat’s scalability, robustness, and rendering fidelity.
By advancing streaming update mechanisms, introducing structured representations for efficient fusion and compression, and coupling pose recovery with adaptive anchor formation, LongSplat establishes a new technical standard for online novel view synthesis in both posed and unposed long video settings.