UniSplat: Unified 3D Scene Reconstruction

Updated 31 May 2026

UniSplat is a unified framework for 3D scene reconstruction using Gaussian splatting that delivers high-quality novel view synthesis for both dynamic driving scenes and unposed multi-view images.
It employs innovative techniques including 3D latent scaffolds, robust spatio-temporal fusion, and dual-branch decoding to generate consistent and coherent scene renderings.
The framework integrates semantic enrichment, dynamic-aware decoding, and static memory mechanisms to ensure spatial and temporal consistency across diverse sensor inputs.

UniSplat is a unified, feed-forward 3D scene reconstruction and representation learning framework centered on 3D Gaussian splatting, designed to handle both dynamic driving scenes with known camera poses and unposed, multi-view scenarios. It achieves state-of-the-art performance in novel view synthesis and delivers spatially and temporally consistent scene completions through innovations in 3D latent scaffolds, robust spatio-temporal fusion, dynamic-aware decoding, and multimodal consistency objectives (Shi et al., 6 Nov 2025, Zhou et al., 12 Apr 2026).

1. Architectural Overview

UniSplat is instantiated in two principal settings: (1) feed-forward 3D reconstruction for autonomous driving from synchronized multi-camera streams with pose information (Shi et al., 6 Nov 2025), and (2) generalizable 3D spatial intelligence from unposed multi-view images (Zhou et al., 12 Apr 2026). In both, the framework produces a 3D Gaussian field that can be rendered into novel views, with explicit outputs for appearance, semantics, per-view geometry (point clouds), and camera parameters as required. The core workflow integrates a vision encoder, progressive fusion, and a multi-branch decoder—augmented by a persistent memory for dynamic/static disentanglement or a pose-conditioned recalibration for cross-modal consistency.

2. 3D Latent Scaffold and Representation Modalities

For driving scene reconstruction (Shi et al., 6 Nov 2025), UniSplat constructs a 3D latent scaffold $\mathbf{S}_t$ by aggregating:

Dense Multi-View Geometry: At each time $t$ , $N_{\mathrm{cam}}$ synchronized images $\{I_t^k\}_{k=1}^{N_{\mathrm{cam}}}$ are processed via a pretrained multi-view geometry network (e.g., $\pi^3$ ), predicting a dense 3D point map per pixel. Metric scale is recovered by an MLP supervised against LiDAR-derived scales.
Semantic Feature Enrichment: A frozen backbone (e.g., DINOv2) extracts per-pixel semantic features which are concatenated with geometry features to build $\mathbf{F}_t$ .
Voxelization: The resulting point cloud $\mathbf{P}_t$ is voxelized into a sparse, ego-centric grid ( $[-72,72]^2\times[-4,12]$ m, voxel size $(0.1,0.1,0.2)$ m). Each voxel $i$ is associated with a feature vector $t$ 0 encoding coarse (pooled) geometry and view-projected semantics.

In the unposed setting (Zhou et al., 12 Apr 2026), an analogous semantic-rich scaffold emerges through a feed-forward dual-masked vision transformer: random masking in the encoder and geometry-focused masking in the decoder enforce strong geometric and semantic priors in the latent tokens.

3. Unified Spatio-Temporal Fusion and Architectural Efficiency

In scenarios with known pose (e.g., autonomous vehicles), spatio-temporal fusion operates directly on the 3D latent scaffold:

Spatial Fusion: Sparse 3D U-Nets refine the scaffold’s features using 3D convolutions, leveraging the voxel-grid structure to naturally align multi-view evidence without costly image-domain attention.
Temporal Fusion: The previous timestep’s scaffold $t$ 1 is warped into the current frame using the vehicle’s pose transform $t$ 2, then combined with the current scaffold via voxel-wise addition or concatenation, and further refined by sparse convolutions. This 3D-centric fusion scales efficiently with occupied voxels and robustly maintains alignment across time, even with sparse or disjoint camera coverage.

In the absence of pose, a pose-conditioned recalibration module supervises cross-modality agreement. Here, predicted 3D points and semantics are reprojected into the image plane using estimated camera parameters for geometric and semantic alignment losses (Zhou et al., 12 Apr 2026).

4. Dynamic-Aware Dual-Branch Decoding via Gaussian Splatting

UniSplat decodes the fused scaffold into a multi-scale set of 3D Gaussian primitives for scene rendering:

Point-Anchored Branch: For each reconstructed metric point, features are retrieved from the fused scaffold and concatenated with 2D-view features, then processed by an MLP to regress Gaussian parameters: center offset, opacity, covariance, color, and a dynamic score $t$ 3.
Voxel-Anchored Branch: For each occupied voxel, a small MLP predicts $t$ 4 Gaussians (typically $t$ 5). The union of point-anchored and voxel-anchored Gaussians forms the full set $t$ 6.

Rendering is performed via differentiable alpha-blending:

$t$ 7

where $t$ 8 is a depth-sorted set of Gaussians contributing to pixel $t$ 9 (Shi et al., 6 Nov 2025).

In the unposed version, a coarse-to-fine Gaussian hierarchy is built: anchor Gaussians $N_{\mathrm{cam}}$ 0 semantic Gaussians $N_{\mathrm{cam}}$ 1 appearance Gaussians, each expanded by respective transformer heads, with explicit rendering of semantic and appearance fields at each stage (Zhou et al., 12 Apr 2026).

5. Static Memory and Streaming Scene Completion

To address occlusions and sensor coverage gaps in dynamic scenes (Shi et al., 6 Nov 2025), UniSplat maintains a persistent memory $N_{\mathrm{cam}}$ 2 of static Gaussians:

At time $N_{\mathrm{cam}}$ 3, memory from the previous frame $N_{\mathrm{cam}}$ 4 is warped to the current frame, with Gaussians inside current camera FOVs filtered out.
The completed representation is

$N_{\mathrm{cam}}$ 5

The memory is updated by accumulating previously unseen Gaussians with low dynamic scores ( $N_{\mathrm{cam}}$ 6 with $N_{\mathrm{cam}}$ 7), suppressing artifacts from moving objects while enabling temporally coherent completions beyond instantaneous sensor coverage.

This mechanism enables robust streaming reconstruction, producing high-fidelity renderings for both observed and extrapolated viewpoints.

6. Training Objectives and Loss Design

Both settings employ composite objectives to enforce geometric fidelity, appearance realism, semantic consistency, and temporal stability.

Driving Scene Setting (Shi et al., 6 Nov 2025):

The composite loss is

$N_{\mathrm{cam}}$ 8

Unposed Setting (Zhou et al., 12 Apr 2026):

The total loss is

$N_{\mathrm{cam}}$ 9

with photometric, perceptual, semantic distillation, geometric priors, and pose-conditioned calibration sub-losses. Recalibration losses specifically enforce alignment between rendered and reprojected RGB and semantic features.

7. Performance Benchmarks and Ablation Analyses

Empirical evaluations demonstrate that UniSplat achieves leading results on challenging real-world datasets:

Dataset / Setting	Input-View PSNR / SSIM / LPIPS	Novel-View PSNR / SSIM / LPIPS	Comparison
Waymo (UniSplat)	28.56 / 0.83 / 0.20	25.12 / 0.74 / 0.27	See below
Waymo (MVSplat)	24.94 / 0.80 / 0.23	22.04 / 0.68 / 0.34	(Shi et al., 6 Nov 2025)
Waymo (DepthSplat)	25.38 / 0.76 / 0.26	23.86 / 0.70 / 0.31	(Shi et al., 6 Nov 2025)
Waymo, LiDAR-aligned	29.58 / 0.86 / 0.17	25.98 / 0.76 / 0.24	(Shi et al., 6 Nov 2025)
nuScenes (UniSplat)	25.37 / 0.765 / 0.246	—	(Shi et al., 6 Nov 2025)
nuScenes (Omni-Scene)	24.27 / 0.736 / 0.237	—	(Shi et al., 6 Nov 2025)

Ablation studies highlight:

The addition of semantic context to the geometry scaffold improves LPIPS by 0.30.
3D spatial fusion increases PSNR by +0.36 dB, temporal fusion by +0.58 dB.
The voxel-anchored branch, compared to point-anchored only, yields +0.46 dB PSNR and −0.08 LPIPS (Shi et al., 6 Nov 2025).

8. Integration into Broader Research Landscape

UniSplat builds upon advancements in 3D Gaussian splatting, learned multi-view geometry, and joint semantic-appearance modeling. By operating in a unified 3D latent space, furnishing spatial-temporal consistency, and integrating learned fusion and decoding strategies, it addresses persistent challenges in sparse-view and dynamic scene reconstruction. Its dual-masking, pose-conditioned calibration, and memory-augmented decoding mechanisms connect it closely to ongoing research in spatial intelligence, embodied AI, and neural scene representations (Shi et al., 6 Nov 2025, Zhou et al., 12 Apr 2026).

A plausible implication is that UniSplat’s design principles—particularly task-adaptive 3D latent fusion and geometric-semantic consistency enforcement—are generalizable across various scene understanding and manipulation tasks, including those in robotics and interactive spatial intelligence.

References:

(Shi et al., 6 Nov 2025) "UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction" (Zhou et al., 12 Apr 2026) "Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images"

Markdown Report Issue Upgrade to Chat

References (2)

UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction (2025)

Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniSplat Framework.