IDSplat: Self-Supervised 3D Dynamic Scene Reconstruction

Updated 26 November 2025

IDSplat is a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic driving scenes by modeling static and dynamic objects as coherent instances using rigid transformations.
It integrates zero-shot language-grounded video tracking with LiDAR anchoring and coordinated-turn smoothing to achieve physically consistent motion estimation and instance decomposition.
The framework delivers competitive reconstruction quality and enables practical downstream applications such as object mask extraction and trajectory editing in autonomous driving scenarios.

IDSplat is a self-supervised 3D Gaussian Splatting framework designed for dynamic scene reconstruction with explicit instance-level decomposition and learnable motion trajectories, targeted for driving scenarios. It explicitly models dynamic objects as coherent instances undergoing rigid transformations through time, achieving efficient, annotation-free instance separation and scene rendering. The approach integrates zero-shot, language-grounded 2D video tracking with LiDAR anchoring in 3D and coordinate-turn smoothing for physically consistent object motion estimation. IDSplat achieves competitive reconstruction quality, robust generalization across diverse sequences and camera densities, and enables downstream manipulation such as object mask extraction and trajectory editing (Lindström et al., 24 Nov 2025).

1. Core Scene Representation

IDSplat represents a scene as a set of 3D Gaussian primitives, each associated with either the static background or a dynamic object instance. Each Gaussian, indexed by $i$ , is parameterized as follows:

Occupancy ( $o_i \in [0,1]$ )
World-frame 3D mean ( $\mu_i \in \mathbb{R}^3$ )
3D covariance ( $\Sigma_i$ , typically factored into scale and orientation)
RGB base-color feature ( $f_i^{rgb} \in \mathbb{R}^3$ )
View-dependent feature ( $f_i \in \mathbb{R}^{D_f}$ )
Discrete instance label ( $z_i$ ; $z_i=0$ for background, $z_i\in\{1,\ldots,N_{dynamic}\}$ for instances)

Each instance $z$ has a temporally-varying rigid transformation $T_{z,t} \in SE(3)$ at frame $t$ , mapping canonical primitive centers to the world frame via

$\mu_{i,t} = T_{z_i, t}\, \mu_i.$

Rendering is performed via “splatting” following SplatAD’s differentiable point-based rasterizer. Each Gaussian projects to a 2D “splat” on the image plane; at pixel $p$ , the weight is

$w_i(p) = o_i \cdot \exp\left[-\frac{1}{2} (\pi(\mu_{i, t}) - p)^T \Sigma_{i,t}^{-1} (\pi(\mu_{i, t}) - p)\right],$

where $\pi(\cdot)$ is the camera projection and $\Sigma_{i,t}$ projected into image space. The per-pixel color,

$C(p) = \sum_{i\in \text{active}(p)} w_i(p)\; c_i,\quad c_i = \mathrm{Decode}(f_i^{rgb}, f_i, \text{view\_dir}),$

is alpha-composited in front-to-back order; accumulated opacity is $\alpha(p) = 1 - \prod_i (1 - w_i(p))$ .

The same process admits an equivalent volumetric view:

$C(r) = \int T(t) \cdot \sigma(r(t)) \cdot c(r(t), d)\, dt,$

with $\sigma(r(t))$ replaced by $o_i N(r(t);\mu_i,\Sigma_i)$ for each Gaussian.

2. Instance Decomposition Pipeline

Instance-level decomposition occurs in a multi-stage, self-supervised pipeline integrating vision and LiDAR data:

Zero-shot Masking and 3D Association

Grounded-SAM-2, prompted with semantic labels (e.g., “car,” “truck”), generates 2D object masks per video frame.
Masks are eroded (~few pixels) to reduce misalignments.
Temporally nearest LiDAR points are projected into each image; each point is assigned a mask ID for candidate-instance association.
DBSCAN clustering (minPts=10, $\epsilon\approx$ 0.5~m) filters outliers, and the largest cluster per instance is retained.

Feature Correspondence and Rigid Pose Initialization

For each instance $z$ $z$ and paired source/target frames $(t_{src}, t_{tgt})$ $(t_{src}, t_{t g t})$ :
- DINOv3 features are extracted per image and projected onto 3D points.
- Features are matched across frames by cosine similarity threshold $>0.8$ .
- RANSAC+Umeyama estimates rigid $SE(3)$ transformation: sample 3 correspondences, estimate $T^{(k)}$ , count inliers (distance $\leq$ 0.1~m), accept largest-inlier hypothesis (ratio $>$ 0.5).
- On success, compose $T_{z, t_{tgt}} = T_{z, t_{tgt} \leftarrow t_{src}} T_{z, t_{src}}$ ; transformed points are merged back into the instance’s canonical point cloud set.

This staged approach yields temporally consistent 3D instance tracks from unannotated videos and LiDAR.

3. Motion Modeling and Temporal Smoothing

IDSplat models motion trajectories as per-frame rigid-body transforms $T_{z,t} = [R_{z,t}\ |\ d_{z,t}] \in SE(3)$ for each instance $z$ . Primitive centers are transformed at each timestep as described above.

A factor-graph, implemented in GTSAM, imposes coordinated-turn (CT) smoothing:

At every frame $t$ , the state vector is $x_t = (T_t \in SE(3), v_t \in \mathbb{R}, \kappa_t \in \mathbb{R})$ , with $v_t$ (forward speed) and $\kappa_t$ (curvature).
Measurement factors for each RANSAC pose $\hat{T}_t$ :

$\phi_{\mathrm{meas}}(T_t, \hat{T}_t) = \rho_{\mathrm{huber}} \left( \| \mathrm{Log}(T_t^{-1} \hat{T}_t) \|_{\Sigma} \right)$

with rotation std $\sim 0.1$ ~rad and translation std $\sim 0.2$ ~m.
CT motion-model factors relate $x_t$ and $x_{t+1}$ :

$\theta = \kappa_t v_t \Delta t \quad \Delta x = \sin \theta / \kappa_t \quad \Delta y = (1 - \cos \theta)/\kappa_t \quad \Delta z = 0\ \Delta R = R_z(\theta)$

The residual constrains $T_{t+1}$ to the predicted motion from $T_t$ .
Priors include random walks for $v_t, \kappa_t$ , small roll/pitch priors ( $\sigma=0.4$ ~rad), and moderate curvature ( $\sigma_\kappa=0.01$ ).
Outlier rejection: measurement factors with large whitened error ( $>1.345$ ) are dropped iteratively (up to 10 L–M solves).

This scheme mitigates pose misalignment and tracking errors, yielding physically plausible, temporally smooth trajectories.

4. Loss Functions and Joint Optimization

IDSplat employs joint optimization of all representation and motion parameters, driven by compositional loss terms:

$L = \lambda_r L_1 + (1 - \lambda_r) L_{\mathrm{SSIM}} + L_{\mathrm{lidar}} + \lambda_{\mathrm{MCMC}} L_{\mathrm{MCMC}}$

with

$L_{\mathrm{lidar}} = \lambda_{\mathrm{depth}} L_{\mathrm{depth}} + \lambda_{\mathrm{los}} L_{\mathrm{los}} + \lambda_{\mathrm{inten}} L_{\mathrm{inten}} + \lambda_{\mathrm{raydrop}} L_{\mathrm{BCE}}$ , where $L_{\mathrm{depth}}$ and $L_{\mathrm{inten}}$ are $L_2$ losses on rendered range/intensity, $L_{\mathrm{los}}$ penalizes opacity before true range, $L_{\mathrm{BCE}}$ is cross-entropy for LiDAR “ray drops.”
$L_{\mathrm{MCMC}} = \lambda_o \sum_i |o_i| + \lambda_\Sigma \sum_{i, j} |\sqrt{\mathrm{eig}_j(\Sigma_i)}|$ , regularizing occupancy and covariance scale.

Optimization variables encompass all Gaussian parameters, decoder weights, sensor embeddings, and dynamic instance poses $T_{z,t}$ . Gaussians and decoders are optimized using Adam for 30k steps, with multi-scale image sampling and scheduled learning rates. Instance poses are initialized with CT-smoothing and further refined via the full loss gradients, integrating both photometric and LiDAR signals.

5. Experimental Results and Evaluation

IDSplat achieves state-of-the-art performance in self-supervised, instance-decomposed scene reconstruction on major autonomous driving benchmarks:

Novel View Synthesis (Waymo Open Dataset)

Comparison metrics across different settings:

Method	PSNR	SSIM	LPIPS	DPSNR
IDSplat (DeSiRe-GS)	30.61	0.897	0.163	28.49
DeSiRe-GS (self-sup.)	28.76	0.873	0.193	26.26
Supervised SplatAD	30.80	0.900	0.160	28.97

In the challenging AD-GS protocol (front camera), IDSplat matches or slightly lags supervised SplatAD while outperforming self-supervised AD-GS.

LiDAR Reconstruction

On dynamic Waymo scenes, IDSplat attains depth RMSE $\approx 0.01$ ~m, intensity error $0.056$ (SplatAD: $0.055$), ray-drop accuracy $87.5\%$ ( $87.3\%$ for SplatAD), and Chamfer distance $1.16$ ($0.98$ for SplatAD), indicating close correspondence to supervised performance.

Generalization and Robustness

IDSplat remains robust under sparse view regimes, maintaining DPSNR $>26$ ~dB at only 25% input views, where other self-supervised methods degrade by $\gg 5$ ~dB. On PandaSet (6 cameras, 50% train, annotation-free), IDSplat nearly matches the best supervised baseline in all metrics.

Qualitative Characteristics

IDSplat produces stable, per-instance masks suitable for object editing. It supports downstream applications such as object removal and trajectory rearrangement at test-time, demonstrated by stable masking and editing results.

6. Context, Significance, and Implications

IDSplat advances instance-aware dynamic scene reconstruction without reliance on object trajectory or mask annotations, aligning with the growing need for scalable, sensor-realistic simulation in autonomous driving. Its explicit instance modeling via rigid transformations improves separation of static and dynamic components relative to undifferentiated or purely time-varying representations. The integration of physically consistent coordinated-turn smoothing addresses typical pose drift and tracking errors, enhancing motion realism for simulation or editing.

A plausible implication is the practical deployment of IDSplat in large-scale, self-supervised pipelines, where annotation costs and re-training must be minimized. Its robustness to changing view densities and capacity to generalize to new driving environments further broaden its applicability. The instance-control afforded by the explicit decomposition offers new avenues in interactive simulation, dynamic object behavior research, and virtual scene manipulation.

For implementation and further technical details, refer to the IDSplat primary source (Lindström et al., 24 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to IDSplat.