Papers
Topics
Authors
Recent
Search
2000 character limit reached

IDSplat: Self-Supervised 3D Dynamic Scene Reconstruction

Updated 26 November 2025
  • IDSplat is a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic driving scenes by modeling static and dynamic objects as coherent instances using rigid transformations.
  • It integrates zero-shot language-grounded video tracking with LiDAR anchoring and coordinated-turn smoothing to achieve physically consistent motion estimation and instance decomposition.
  • The framework delivers competitive reconstruction quality and enables practical downstream applications such as object mask extraction and trajectory editing in autonomous driving scenarios.

IDSplat is a self-supervised 3D Gaussian Splatting framework designed for dynamic scene reconstruction with explicit instance-level decomposition and learnable motion trajectories, targeted for driving scenarios. It explicitly models dynamic objects as coherent instances undergoing rigid transformations through time, achieving efficient, annotation-free instance separation and scene rendering. The approach integrates zero-shot, language-grounded 2D video tracking with LiDAR anchoring in 3D and coordinate-turn smoothing for physically consistent object motion estimation. IDSplat achieves competitive reconstruction quality, robust generalization across diverse sequences and camera densities, and enables downstream manipulation such as object mask extraction and trajectory editing (Lindström et al., 24 Nov 2025).

1. Core Scene Representation

IDSplat represents a scene as a set of 3D Gaussian primitives, each associated with either the static background or a dynamic object instance. Each Gaussian, indexed by ii, is parameterized as follows:

  • Occupancy (oi[0,1]o_i \in [0,1])
  • World-frame 3D mean (μiR3\mu_i \in \mathbb{R}^3)
  • 3D covariance (Σi\Sigma_i, typically factored into scale and orientation)
  • RGB base-color feature (firgbR3f_i^{rgb} \in \mathbb{R}^3)
  • View-dependent feature (fiRDff_i \in \mathbb{R}^{D_f})
  • Discrete instance label (ziz_i; zi=0z_i=0 for background, zi{1,,Ndynamic}z_i\in\{1,\ldots,N_{dynamic}\} for instances)

Each instance zz has a temporally-varying rigid transformation oi[0,1]o_i \in [0,1]0 at frame oi[0,1]o_i \in [0,1]1, mapping canonical primitive centers to the world frame via

oi[0,1]o_i \in [0,1]2

Rendering is performed via “splatting” following SplatAD’s differentiable point-based rasterizer. Each Gaussian projects to a 2D “splat” on the image plane; at pixel oi[0,1]o_i \in [0,1]3, the weight is

oi[0,1]o_i \in [0,1]4

where oi[0,1]o_i \in [0,1]5 is the camera projection and oi[0,1]o_i \in [0,1]6 projected into image space. The per-pixel color,

oi[0,1]o_i \in [0,1]7

is alpha-composited in front-to-back order; accumulated opacity is oi[0,1]o_i \in [0,1]8.

The same process admits an equivalent volumetric view:

oi[0,1]o_i \in [0,1]9

with μiR3\mu_i \in \mathbb{R}^30 replaced by μiR3\mu_i \in \mathbb{R}^31 for each Gaussian.

2. Instance Decomposition Pipeline

Instance-level decomposition occurs in a multi-stage, self-supervised pipeline integrating vision and LiDAR data:

Zero-shot Masking and 3D Association

  • Grounded-SAM-2, prompted with semantic labels (e.g., “car,” “truck”), generates 2D object masks per video frame.
  • Masks are eroded (~few pixels) to reduce misalignments.
  • Temporally nearest LiDAR points are projected into each image; each point is assigned a mask ID for candidate-instance association.
  • DBSCAN clustering (minPts=10, μiR3\mu_i \in \mathbb{R}^32 0.5~m) filters outliers, and the largest cluster per instance is retained.

Feature Correspondence and Rigid Pose Initialization

  • For each instance μiR3\mu_i \in \mathbb{R}^33 and paired source/target frames μiR3\mu_i \in \mathbb{R}^34:
    • DINOv3 features are extracted per image and projected onto 3D points.
    • Features are matched across frames by cosine similarity threshold μiR3\mu_i \in \mathbb{R}^35.
    • RANSAC+Umeyama estimates rigid μiR3\mu_i \in \mathbb{R}^36 transformation: sample 3 correspondences, estimate μiR3\mu_i \in \mathbb{R}^37, count inliers (distance μiR3\mu_i \in \mathbb{R}^38 0.1~m), accept largest-inlier hypothesis (ratio μiR3\mu_i \in \mathbb{R}^39 0.5).
    • On success, compose Σi\Sigma_i0; transformed points are merged back into the instance’s canonical point cloud set.

This staged approach yields temporally consistent 3D instance tracks from unannotated videos and LiDAR.

3. Motion Modeling and Temporal Smoothing

IDSplat models motion trajectories as per-frame rigid-body transforms Σi\Sigma_i1 for each instance Σi\Sigma_i2. Primitive centers are transformed at each timestep as described above.

A factor-graph, implemented in GTSAM, imposes coordinated-turn (CT) smoothing:

  • At every frame Σi\Sigma_i3, the state vector is Σi\Sigma_i4, with Σi\Sigma_i5 (forward speed) and Σi\Sigma_i6 (curvature).
  • Measurement factors for each RANSAC pose Σi\Sigma_i7:

    Σi\Sigma_i8

    with rotation std Σi\Sigma_i9~rad and translation std firgbR3f_i^{rgb} \in \mathbb{R}^30~m.

  • CT motion-model factors relate firgbR3f_i^{rgb} \in \mathbb{R}^31 and firgbR3f_i^{rgb} \in \mathbb{R}^32:

    firgbR3f_i^{rgb} \in \mathbb{R}^33

    The residual constrains firgbR3f_i^{rgb} \in \mathbb{R}^34 to the predicted motion from firgbR3f_i^{rgb} \in \mathbb{R}^35.

  • Priors include random walks for firgbR3f_i^{rgb} \in \mathbb{R}^36, small roll/pitch priors (firgbR3f_i^{rgb} \in \mathbb{R}^37~rad), and moderate curvature (firgbR3f_i^{rgb} \in \mathbb{R}^38).
  • Outlier rejection: measurement factors with large whitened error (firgbR3f_i^{rgb} \in \mathbb{R}^39) are dropped iteratively (up to 10 L–M solves).

This scheme mitigates pose misalignment and tracking errors, yielding physically plausible, temporally smooth trajectories.

4. Loss Functions and Joint Optimization

IDSplat employs joint optimization of all representation and motion parameters, driven by compositional loss terms:

fiRDff_i \in \mathbb{R}^{D_f}0

with

  • fiRDff_i \in \mathbb{R}^{D_f}1, where fiRDff_i \in \mathbb{R}^{D_f}2 and fiRDff_i \in \mathbb{R}^{D_f}3 are fiRDff_i \in \mathbb{R}^{D_f}4 losses on rendered range/intensity, fiRDff_i \in \mathbb{R}^{D_f}5 penalizes opacity before true range, fiRDff_i \in \mathbb{R}^{D_f}6 is cross-entropy for LiDAR “ray drops.”
  • fiRDff_i \in \mathbb{R}^{D_f}7, regularizing occupancy and covariance scale.

Optimization variables encompass all Gaussian parameters, decoder weights, sensor embeddings, and dynamic instance poses fiRDff_i \in \mathbb{R}^{D_f}8. Gaussians and decoders are optimized using Adam for 30k steps, with multi-scale image sampling and scheduled learning rates. Instance poses are initialized with CT-smoothing and further refined via the full loss gradients, integrating both photometric and LiDAR signals.

5. Experimental Results and Evaluation

IDSplat achieves state-of-the-art performance in self-supervised, instance-decomposed scene reconstruction on major autonomous driving benchmarks:

Novel View Synthesis (Waymo Open Dataset)

Comparison metrics across different settings:

Method PSNR SSIM LPIPS DPSNR
IDSplat (DeSiRe-GS) 30.61 0.897 0.163 28.49
DeSiRe-GS (self-sup.) 28.76 0.873 0.193 26.26
Supervised SplatAD 30.80 0.900 0.160 28.97

In the challenging AD-GS protocol (front camera), IDSplat matches or slightly lags supervised SplatAD while outperforming self-supervised AD-GS.

LiDAR Reconstruction

  • On dynamic Waymo scenes, IDSplat attains depth RMSE fiRDff_i \in \mathbb{R}^{D_f}9~m, intensity error ziz_i0 (SplatAD: ziz_i1), ray-drop accuracy ziz_i2 (ziz_i3 for SplatAD), and Chamfer distance ziz_i4 (ziz_i5 for SplatAD), indicating close correspondence to supervised performance.

Generalization and Robustness

IDSplat remains robust under sparse view regimes, maintaining DPSNR ziz_i6~dB at only 25% input views, where other self-supervised methods degrade by ziz_i7~dB. On PandaSet (6 cameras, 50% train, annotation-free), IDSplat nearly matches the best supervised baseline in all metrics.

Qualitative Characteristics

IDSplat produces stable, per-instance masks suitable for object editing. It supports downstream applications such as object removal and trajectory rearrangement at test-time, demonstrated by stable masking and editing results.

6. Context, Significance, and Implications

IDSplat advances instance-aware dynamic scene reconstruction without reliance on object trajectory or mask annotations, aligning with the growing need for scalable, sensor-realistic simulation in autonomous driving. Its explicit instance modeling via rigid transformations improves separation of static and dynamic components relative to undifferentiated or purely time-varying representations. The integration of physically consistent coordinated-turn smoothing addresses typical pose drift and tracking errors, enhancing motion realism for simulation or editing.

A plausible implication is the practical deployment of IDSplat in large-scale, self-supervised pipelines, where annotation costs and re-training must be minimized. Its robustness to changing view densities and capacity to generalize to new driving environments further broaden its applicability. The instance-control afforded by the explicit decomposition offers new avenues in interactive simulation, dynamic object behavior research, and virtual scene manipulation.

For implementation and further technical details, refer to the IDSplat primary source (Lindström et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IDSplat.