Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Supervised Street Gaussian

Updated 26 February 2026
  • Self-Supervised Street Gaussian (S^3Gaussian) is a framework for photorealistic 3D street scene reconstruction using self-supervised, annotation-free modeling with explicit 3D Gaussian primitives.
  • It leverages spatio-temporal deformation networks and HexPlane encoders to effectively separate static geometry from dynamic object motions in urban environments.
  • Real-time rendering and robust mapping are achieved through efficient 3D splatting techniques, providing closed-loop simulation capabilities for autonomous driving applications.

Self-Supervised Street Gaussian (S3S^3Gaussian) approaches constitute a family of methods for photorealistic 3D scene reconstruction in large-scale urban environments, emphasizing the explicit, real-time, and annotation-free modeling of both static and dynamic elements. These methods harness 3D Gaussian Splatting (3DGS) as their core primitive, extending it to handle the spatiotemporal complexity of autonomous driving scenes via self-supervised training on monocular images, raw LiDAR, and other sensor signals without 3D bounding box or segmentation label dependencies. Notable instantiations include the original S3^3Gaussian formulation (Huang et al., 2024), 3D Gaussian Mapping for environment-object decomposition from multitraverse RGB (Li et al., 2024), and advanced variants with dynamic-instance or motion-curve modeling (Su et al., 10 Nov 2025, Peng et al., 2024, Xu et al., 16 Jul 2025). These techniques have become integral to closed-loop simulator development, enabling state-of-the-art free-viewpoint rendering and mapping in annotation-scarce urban driving contexts.

1. Problem Statement and Motivation

S3^3Gaussian methods address the need for scalable, photorealistic 3D reconstruction of street scenes, particularly to support real-time closed-loop autonomous driving simulation. Conventional techniques based on Neural Radiance Fields (NeRF) yield high visual fidelity, but are typically hampered by prohibitive training and rendering latency, as well as implicit scene representations that complicate editing and instance management. 3D Gaussian Splatting enables faster rendering through explicit, anisotropic Gaussians and direct splatting-based rasterization. However, prior 3DGS approaches for dynamic scenes have often required expensive supervision via 3D object bounding boxes or external object tracklets to disentangle static and dynamic components, constraining their utility in in-the-wild data with unlabeled or ambiguous object motion. The S3^3Gaussian paradigm overcomes these limitations by leveraging only unlabeled multi-view imagery, temporal metadata, and (optionally) LiDAR priors—eschewing any manual track annotation or mask input for decomposing static structure and dynamic actors (Huang et al., 2024).

2. 4D Gaussian Primitive and Representation

At the core of S3^3Gaussian is the representation of a street scene as a set G\mathcal G of 3D Gaussian primitives, each parameterized by:

  • Mean (position) XiR3\mathcal X_i \in \mathbb{R}^3
  • Covariance Σi=RiSiSiRiR3×3\Sigma_i = R_i S_i S_i^\top R_i^\top \in \mathbb{R}^{3 \times 3}, with RiSO(3)R_i \in SO(3) (rotation) and SiS_i a diagonal scale
  • Opacity αiR\alpha_i \in \mathbb{R}
  • SH color coefficients CiR3(k+1)2C_i \in \mathbb{R}^{3^{(k+1)^2}} (spherical harmonics up to degree kk for view-dependent color)

S3^3Gaussian generalizes 3DGS by modeling temporal dynamics. Each Gaussian's parameters are deformed over time tt by a learned spatio-temporal field, yielding G(t)\mathcal G'(t) (the deformed Gaussian set at time tt). This deformation includes translation offsets ΔX\Delta\mathcal X, SH color deltas ΔSH\Delta \text{SH}, and latent semantic codes for static/dynamic separation (Huang et al., 2024, Su et al., 10 Nov 2025, Peng et al., 2024). Advanced formulations introduce further instance-level identifiers and individual lifecycle variables (e.g., velocity vectors, appearance codes) to enable dynamic-instance tracking and object-level editing (Su et al., 10 Nov 2025).

Rendering proceeds by projecting each Gaussian to the image plane. The projected covariance is

Σi=JWΣiWJ\Sigma_i' = J W \Sigma_i W^\top J^\top

where WW is the world-to-camera transformation and JJ is the Jacobian of the perspective projection. Splatting and alpha compositing along each camera ray yield the per-pixel color as

C=i=1Nciαij<i(1αj)C = \sum_{i=1}^N c_i \alpha_i \prod_{j < i} (1 - \alpha_j)

with colors cic_i and opacities αi\alpha_i as defined above.

3. Spatio-Temporal Modeling and Decomposition Mechanisms

S3^3Gaussian employs a spatio-temporal deformation network to capture scene dynamics in self-supervised fashion. A canonical approach is the HexPlane structure encoder, a multi-resolution tiling of 3D spatial (xy,xz,yz)(xy, xz, yz) and 3D spatio-temporal (xt,yt,zt)(xt, yt, zt) planes at various resolutions. For a given 4D coordinate (x,y,z,t)(x, y, z, t), the encoder projects onto all planes via bilinear interpolation, fuses the results with a small MLP, and decodes per-Gaussian offsets and semantic features (Huang et al., 2024). The spatial planes primarily encode static geometry, while the space-time planes highlight dynamic changes, thus facilitating learned decomposition of moving and stationary regions.

Self-supervised dynamic decomposition proceeds by leveraging 4D consistency and multi-term objectives:

  • Static-only warmup: Optimize for pure static geometry to initialize the representation.
  • Dynamic induction: Enable deformation parameters and semantic codes, training on sequences (typically 50-frame clips) and using deformation regularization to avoid spurious dynamics.
  • Temporal field continuity: Sequentially initialize each clip with the prior's spatio-temporal field parameters to maintain scene consistency over long sequences (Huang et al., 2024, Su et al., 10 Nov 2025).
  • Instance-level discovery: More recent variants detect dynamic objects by measuring appearance–position inconsistencies under temporal warping, as introduced in DIAL-GS (Su et al., 10 Nov 2025), and then instantiate dynamic-identity and consistency losses to reinforce instance-aware Gaussians.

4. Self-Supervised Loss Functions

The self-supervised training framework optimizes multi-term objectives that jointly enforce photometric fidelity, depth accuracy, and spatio-temporal consistency, with deformation regularization to minimize unnecessary motion:

L=λrgbLrgb+λssimLssim+λdepthLdepth+λfeatLfeat+λtvLtv+λregxLregx+λregcLregc\mathcal{L} = \lambda_{rgb} L_{rgb} + \lambda_{ssim} L_{ssim} + \lambda_{depth} L_{depth} + \lambda_{feat} L_{feat} + \lambda_{tv} L_{tv} + \lambda_{reg}^x L_{reg}^x + \lambda_{reg}^c L_{reg}^c

where representative terms include:

  • LrgbL_{rgb}: 1\ell_1 color reconstruction loss
  • LssimL_{ssim}: Structural Similarity Index penalty
  • LdepthL_{depth}: depth supervision from LiDAR when available
  • LfeatL_{feat}: latent-semantic feature or per-Gaussian semantic code consistency loss
  • LtvL_{tv}: total variation on HexPlane parameter grids
  • Lregx,LregcL_{reg}^x, L_{reg}^c: Pointwise regularization on Gaussian translation and color deformation to enforce parsimony

Loss terms are selected and weighted to balance accurate reconstruction, temporal coherence, and minimization of deformation in static regions. In advanced S3^3Gaussian methods, additional objectives can include sky-mask regularization, velocity and lifespan priors, instance-level ID clustering (KL divergence), and dynamic-consistency penalties (matching velocity magnitude/direction within instances) (Su et al., 10 Nov 2025, Peng et al., 2024, Xu et al., 16 Jul 2025).

5. Training Protocols and Data Pipeline

S3^3Gaussian methods typically operate as follows (Huang et al., 2024):

  • Initialization: Voxel-downsample raw LiDAR point clouds; seed one Gaussian per LiDAR point (mean, small isotropic scale, random color/SH). COLMAP or SfM-based point cloud initialization is also common (Li et al., 2024).
  • Static Geometry Warmup: Optimize static 3DGS for a number of iterations (e.g., 5k-30k) to stabilize scene geometry before enabling dynamic deformation.
  • Clip-Based Temporal Training: Process long trajectories as sequence clips (e.g., 50-frame batches), training in a staged manner with the spatio-temporal field reused across successive clips, preserving continuity.
  • Network Architecture: HexPlane encoder resolution typically starts at r=64r=64, with multi-scale upsampling as in Instant-NGP.
  • Optimization: Adam optimizer, initial learning rates 1.6×103\sim 1.6 \times 10^{-3}, decayed as needed. Batch size and ray sampling schemes follow standard 3DGS defaults.
  • Self-supervision: No bounding boxes, masks, or external object annotations are required at any stage.

In 3D Gaussian Mapping for multitraverse mapping (Li et al., 2024), repeated drives enable feature-residual mining to automatically generate per-pixel ephemerality masks, which drive decomposition of permanent environment Gaussians versus ephemeral (dynamic) objects.

6. Empirical Results and Comparative Performance

S3^3Gaussian approaches have been extensively benchmarked on automotive datasets including Waymo Open, Mapverse-Ithaca365, Mapverse-nuPlan, and others, with the following representative results (Huang et al., 2024, Li et al., 2024, Su et al., 10 Nov 2025, Peng et al., 2024):

Waymo-Open (Dynamic32 subset, scene reconstruction):

Method PSNR SSIM PSNR* (dynamic)
3DGS 28.47 0.876
EmerNeRF 28.16 24.32
S3^3Gaussian 31.35 0.911 26.02

Mapverse (ephemerality segmentation):

Method Mean IoU (%)
EmerSeg 45.1
STEGO/CAUSE 19–24

Novel-View Synthesis (SSIM, PSNR):

  • EnvGS with self-supervised mask matches supervised baselines (SSIM \sim 0.806, PSNR \sim 22.78 on Mapverse (Li et al., 2024)).
  • Rendering speed is real-time (30\sim 30 ms/frame), compared to hundreds of ms for NeRF.

Qualitative benefits: S3^3Gaussian suppresses ghosting artifacts on dynamic objects, yields sharper backgrounds/skies, and supports faithful view synthesis even in the presence of complex street dynamics.

Advanced Methods:

  • DIAL-GS achieves 36.88 PSNR, 0.948 SSIM, 0.113 LPIPS on Waymo Open (Su et al., 10 Nov 2025), outperforming DeSiRe-GS, PVG, and baselines; enables instance-level editing under full self-supervision.

7. Strengths, Limitations, and Extensions

Strengths

  • Fully self-supervised static/dynamic decomposition, requiring no 3D bounding boxes or semantic masks.
  • Real-time rendering enabled by explicit 3D splatting and efficient HexPlane latent grids.
  • Superior photorealism and geometric accuracy on large-scale urban driving datasets.
  • Robustness to out-of-distribution dynamics via spatio-temporal regularization and ephemerality mining from repeated traversals or temporal consistency.

Limitations

  • Modeling of high-velocity or extremely non-rigid dynamics (e.g., articulated pedestrians) is limited; the deformation fields in S3^3Gaussian assume predominantly rigid motion and may underfit sparse or highly variable observations.
  • Semantic supervision is limited to latent codes; explicit class labels are not used except in some foundation-model-aligned or instance-aware variants.
  • Current pipelines operate in sliding temporal windows (e.g., 50-frame clips), with full long-horizon temporal regularization remaining an open challenge.

Potential Extensions

  • Incorporation of unsupervised optical flow or more advanced spatio-temporal segmentation cues to handle non-rigid, overtly articulated, or rare dynamics (Huang et al., 2024, Xu et al., 16 Jul 2025).
  • Introduction of hierarchical or instance-aware grouping for object-level manipulation and simulation (Su et al., 10 Nov 2025).
  • Integration of differentiable physics or kinematic priors for even more plausible trajectory modeling.
  • Extension to foundation model alignment and open-vocabulary semantic occupancy prediction, as demonstrated in GaussTR (Jiang et al., 2024).

These approaches collectively position S3^3Gaussian and its successors as a unifying framework for fully self-supervised, efficient, and high-fidelity 3D street scene reconstruction, enabling scalable data-driven simulation, robust mapping, and open-world perception in autonomous driving and related robotics applications (Huang et al., 2024, Li et al., 2024, Su et al., 10 Nov 2025, Peng et al., 2024, Xu et al., 16 Jul 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Supervised Street Gaussian ($S^3$Gaussian).