Self-Supervised Street Gaussian

Updated 26 February 2026

Self-Supervised Street Gaussian (S^3Gaussian) is a framework for photorealistic 3D street scene reconstruction using self-supervised, annotation-free modeling with explicit 3D Gaussian primitives.
It leverages spatio-temporal deformation networks and HexPlane encoders to effectively separate static geometry from dynamic object motions in urban environments.
Real-time rendering and robust mapping are achieved through efficient 3D splatting techniques, providing closed-loop simulation capabilities for autonomous driving applications.

Self-Supervised Street Gaussian ( $S^3$ Gaussian) approaches constitute a family of methods for photorealistic 3D scene reconstruction in large-scale urban environments, emphasizing the explicit, real-time, and annotation-free modeling of both static and dynamic elements. These methods harness 3D Gaussian Splatting (3DGS) as their core primitive, extending it to handle the spatiotemporal complexity of autonomous driving scenes via self-supervised training on monocular images, raw LiDAR, and other sensor signals without 3D bounding box or segmentation label dependencies. Notable instantiations include the original S $^3$ Gaussian formulation (Huang et al., 2024), 3D Gaussian Mapping for environment-object decomposition from multitraverse RGB (Li et al., 2024), and advanced variants with dynamic-instance or motion-curve modeling (Su et al., 10 Nov 2025, Peng et al., 2024, Xu et al., 16 Jul 2025). These techniques have become integral to closed-loop simulator development, enabling state-of-the-art free-viewpoint rendering and mapping in annotation-scarce urban driving contexts.

1. Problem Statement and Motivation

S $^3$ Gaussian methods address the need for scalable, photorealistic 3D reconstruction of street scenes, particularly to support real-time closed-loop autonomous driving simulation. Conventional techniques based on Neural Radiance Fields (NeRF) yield high visual fidelity, but are typically hampered by prohibitive training and rendering latency, as well as implicit scene representations that complicate editing and instance management. 3D Gaussian Splatting enables faster rendering through explicit, anisotropic Gaussians and direct splatting-based rasterization. However, prior 3DGS approaches for dynamic scenes have often required expensive supervision via 3D object bounding boxes or external object tracklets to disentangle static and dynamic components, constraining their utility in in-the-wild data with unlabeled or ambiguous object motion. The S $^3$ Gaussian paradigm overcomes these limitations by leveraging only unlabeled multi-view imagery, temporal metadata, and (optionally) LiDAR priors—eschewing any manual track annotation or mask input for decomposing static structure and dynamic actors (Huang et al., 2024).

2. 4D Gaussian Primitive and Representation

At the core of S $^3$ Gaussian is the representation of a street scene as a set $\mathcal G$ of 3D Gaussian primitives, each parameterized by:

Mean (position) $\mathcal X_i \in \mathbb{R}^3$
Covariance $\Sigma_i = R_i S_i S_i^\top R_i^\top \in \mathbb{R}^{3 \times 3}$ , with $R_i \in SO(3)$ (rotation) and $S_i$ a diagonal scale
Opacity $\alpha_i \in \mathbb{R}$
SH color coefficients $C_i \in \mathbb{R}^{3^{(k+1)^2}}$ (spherical harmonics up to degree $k$ for view-dependent color)

S $^3$ Gaussian generalizes 3DGS by modeling temporal dynamics. Each Gaussian's parameters are deformed over time $t$ by a learned spatio-temporal field, yielding $\mathcal G'(t)$ (the deformed Gaussian set at time $t$ ). This deformation includes translation offsets $\Delta\mathcal X$ , SH color deltas $\Delta \text{SH}$ , and latent semantic codes for static/dynamic separation (Huang et al., 2024, Su et al., 10 Nov 2025, Peng et al., 2024). Advanced formulations introduce further instance-level identifiers and individual lifecycle variables (e.g., velocity vectors, appearance codes) to enable dynamic-instance tracking and object-level editing (Su et al., 10 Nov 2025).

Rendering proceeds by projecting each Gaussian to the image plane. The projected covariance is

$\Sigma_i' = J W \Sigma_i W^\top J^\top$

where $W$ is the world-to-camera transformation and $J$ is the Jacobian of the perspective projection. Splatting and alpha compositing along each camera ray yield the per-pixel color as

$C = \sum_{i=1}^N c_i \alpha_i \prod_{j < i} (1 - \alpha_j)$

with colors $c_i$ and opacities $\alpha_i$ as defined above.

3. Spatio-Temporal Modeling and Decomposition Mechanisms

S $^3$ Gaussian employs a spatio-temporal deformation network to capture scene dynamics in self-supervised fashion. A canonical approach is the HexPlane structure encoder, a multi-resolution tiling of 3D spatial $(xy, xz, yz)$ and 3D spatio-temporal $(xt, yt, zt)$ planes at various resolutions. For a given 4D coordinate $(x, y, z, t)$ , the encoder projects onto all planes via bilinear interpolation, fuses the results with a small MLP, and decodes per-Gaussian offsets and semantic features (Huang et al., 2024). The spatial planes primarily encode static geometry, while the space-time planes highlight dynamic changes, thus facilitating learned decomposition of moving and stationary regions.

Self-supervised dynamic decomposition proceeds by leveraging 4D consistency and multi-term objectives:

Static-only warmup: Optimize for pure static geometry to initialize the representation.
Dynamic induction: Enable deformation parameters and semantic codes, training on sequences (typically 50-frame clips) and using deformation regularization to avoid spurious dynamics.
Temporal field continuity: Sequentially initialize each clip with the prior's spatio-temporal field parameters to maintain scene consistency over long sequences (Huang et al., 2024, Su et al., 10 Nov 2025).
Instance-level discovery: More recent variants detect dynamic objects by measuring appearance–position inconsistencies under temporal warping, as introduced in DIAL-GS (Su et al., 10 Nov 2025), and then instantiate dynamic-identity and consistency losses to reinforce instance-aware Gaussians.

4. Self-Supervised Loss Functions

The self-supervised training framework optimizes multi-term objectives that jointly enforce photometric fidelity, depth accuracy, and spatio-temporal consistency, with deformation regularization to minimize unnecessary motion:

$\mathcal{L} = \lambda_{rgb} L_{rgb} + \lambda_{ssim} L_{ssim} + \lambda_{depth} L_{depth} + \lambda_{feat} L_{feat} + \lambda_{tv} L_{tv} + \lambda_{reg}^x L_{reg}^x + \lambda_{reg}^c L_{reg}^c$

where representative terms include:

$L_{rgb}$ : $\ell_1$ color reconstruction loss
$L_{ssim}$ : Structural Similarity Index penalty
$L_{depth}$ : depth supervision from LiDAR when available
$L_{feat}$ : latent-semantic feature or per-Gaussian semantic code consistency loss
$L_{tv}$ : total variation on HexPlane parameter grids
$L_{reg}^x, L_{reg}^c$ : Pointwise regularization on Gaussian translation and color deformation to enforce parsimony

Loss terms are selected and weighted to balance accurate reconstruction, temporal coherence, and minimization of deformation in static regions. In advanced S $^3$ Gaussian methods, additional objectives can include sky-mask regularization, velocity and lifespan priors, instance-level ID clustering (KL divergence), and dynamic-consistency penalties (matching velocity magnitude/direction within instances) (Su et al., 10 Nov 2025, Peng et al., 2024, Xu et al., 16 Jul 2025).

5. Training Protocols and Data Pipeline

S $^3$ Gaussian methods typically operate as follows (Huang et al., 2024):

Initialization: Voxel-downsample raw LiDAR point clouds; seed one Gaussian per LiDAR point (mean, small isotropic scale, random color/SH). COLMAP or SfM-based point cloud initialization is also common (Li et al., 2024).
Static Geometry Warmup: Optimize static 3DGS for a number of iterations (e.g., 5k-30k) to stabilize scene geometry before enabling dynamic deformation.
Clip-Based Temporal Training: Process long trajectories as sequence clips (e.g., 50-frame batches), training in a staged manner with the spatio-temporal field reused across successive clips, preserving continuity.
Network Architecture: HexPlane encoder resolution typically starts at $r=64$ , with multi-scale upsampling as in Instant-NGP.
Optimization: Adam optimizer, initial learning rates $\sim 1.6 \times 10^{-3}$ , decayed as needed. Batch size and ray sampling schemes follow standard 3DGS defaults.
Self-supervision: No bounding boxes, masks, or external object annotations are required at any stage.

In 3D Gaussian Mapping for multitraverse mapping (Li et al., 2024), repeated drives enable feature-residual mining to automatically generate per-pixel ephemerality masks, which drive decomposition of permanent environment Gaussians versus ephemeral (dynamic) objects.

6. Empirical Results and Comparative Performance

S $^3$ Gaussian approaches have been extensively benchmarked on automotive datasets including Waymo Open, Mapverse-Ithaca365, Mapverse-nuPlan, and others, with the following representative results (Huang et al., 2024, Li et al., 2024, Su et al., 10 Nov 2025, Peng et al., 2024):

Waymo-Open (Dynamic32 subset, scene reconstruction):

Method	PSNR	SSIM	PSNR* (dynamic)
3DGS	28.47	0.876	–
EmerNeRF	28.16	–	24.32
S $^3$ Gaussian	31.35	0.911	26.02

Mapverse (ephemerality segmentation):

Method	Mean IoU (%)
EmerSeg	45.1
STEGO/CAUSE	19–24

Novel-View Synthesis (SSIM, PSNR):

EnvGS with self-supervised mask matches supervised baselines (SSIM $\sim$ 0.806, PSNR $\sim$ 22.78 on Mapverse (Li et al., 2024)).
Rendering speed is real-time ( $\sim 30$ ms/frame), compared to hundreds of ms for NeRF.

Qualitative benefits: S $^3$ Gaussian suppresses ghosting artifacts on dynamic objects, yields sharper backgrounds/skies, and supports faithful view synthesis even in the presence of complex street dynamics.

Advanced Methods:

DIAL-GS achieves 36.88 PSNR, 0.948 SSIM, 0.113 LPIPS on Waymo Open (Su et al., 10 Nov 2025), outperforming DeSiRe-GS, PVG, and baselines; enables instance-level editing under full self-supervision.

7. Strengths, Limitations, and Extensions

Strengths

Fully self-supervised static/dynamic decomposition, requiring no 3D bounding boxes or semantic masks.
Real-time rendering enabled by explicit 3D splatting and efficient HexPlane latent grids.
Superior photorealism and geometric accuracy on large-scale urban driving datasets.
Robustness to out-of-distribution dynamics via spatio-temporal regularization and ephemerality mining from repeated traversals or temporal consistency.

Limitations

Modeling of high-velocity or extremely non-rigid dynamics (e.g., articulated pedestrians) is limited; the deformation fields in S $^3$ Gaussian assume predominantly rigid motion and may underfit sparse or highly variable observations.
Semantic supervision is limited to latent codes; explicit class labels are not used except in some foundation-model-aligned or instance-aware variants.
Current pipelines operate in sliding temporal windows (e.g., 50-frame clips), with full long-horizon temporal regularization remaining an open challenge.

Potential Extensions

Incorporation of unsupervised optical flow or more advanced spatio-temporal segmentation cues to handle non-rigid, overtly articulated, or rare dynamics (Huang et al., 2024, Xu et al., 16 Jul 2025).
Introduction of hierarchical or instance-aware grouping for object-level manipulation and simulation (Su et al., 10 Nov 2025).
Integration of differentiable physics or kinematic priors for even more plausible trajectory modeling.
Extension to foundation model alignment and open-vocabulary semantic occupancy prediction, as demonstrated in GaussTR (Jiang et al., 2024).

These approaches collectively position S $^3$ Gaussian and its successors as a unifying framework for fully self-supervised, efficient, and high-fidelity 3D street scene reconstruction, enabling scalable data-driven simulation, robust mapping, and open-world perception in autonomous driving and related robotics applications (Huang et al., 2024, Li et al., 2024, Su et al., 10 Nov 2025, Peng et al., 2024, Xu et al., 16 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (6)

$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving (2024)

Memorize What Matters: Emergent Scene Decomposition from Multitraverse (2024)

DIAL-GS: Dynamic Instance Aware Reconstruction for Label-free Street Scenes with 4D Gaussian Splatting (2025)

DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes (2024)

AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving (2025)

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Supervised Street Gaussian ($S^3$Gaussian).

Self-Supervised Street Gaussian

1. Problem Statement and Motivation

2. 4D Gaussian Primitive and Representation

3. Spatio-Temporal Modeling and Decomposition Mechanisms

4. Self-Supervised Loss Functions

5. Training Protocols and Data Pipeline

6. Empirical Results and Comparative Performance

7. Strengths, Limitations, and Extensions

Strengths

Limitations

Potential Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Self-Supervised Street Gaussian

1. Problem Statement and Motivation

2. 4D Gaussian Primitive and Representation

3. Spatio-Temporal Modeling and Decomposition Mechanisms

4. Self-Supervised Loss Functions

5. Training Protocols and Data Pipeline

6. Empirical Results and Comparative Performance

7. Strengths, Limitations, and Extensions

Strengths

Limitations

Potential Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research