Papers
Topics
Authors
Recent
2000 character limit reached

IDSplat: Self-Supervised 3D Dynamic Scene Reconstruction

Updated 26 November 2025
  • IDSplat is a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic driving scenes by modeling static and dynamic objects as coherent instances using rigid transformations.
  • It integrates zero-shot language-grounded video tracking with LiDAR anchoring and coordinated-turn smoothing to achieve physically consistent motion estimation and instance decomposition.
  • The framework delivers competitive reconstruction quality and enables practical downstream applications such as object mask extraction and trajectory editing in autonomous driving scenarios.

IDSplat is a self-supervised 3D Gaussian Splatting framework designed for dynamic scene reconstruction with explicit instance-level decomposition and learnable motion trajectories, targeted for driving scenarios. It explicitly models dynamic objects as coherent instances undergoing rigid transformations through time, achieving efficient, annotation-free instance separation and scene rendering. The approach integrates zero-shot, language-grounded 2D video tracking with LiDAR anchoring in 3D and coordinate-turn smoothing for physically consistent object motion estimation. IDSplat achieves competitive reconstruction quality, robust generalization across diverse sequences and camera densities, and enables downstream manipulation such as object mask extraction and trajectory editing (Lindström et al., 24 Nov 2025).

1. Core Scene Representation

IDSplat represents a scene as a set of 3D Gaussian primitives, each associated with either the static background or a dynamic object instance. Each Gaussian, indexed by ii, is parameterized as follows:

  • Occupancy (oi[0,1]o_i \in [0,1])
  • World-frame 3D mean (μiR3\mu_i \in \mathbb{R}^3)
  • 3D covariance (Σi\Sigma_i, typically factored into scale and orientation)
  • RGB base-color feature (firgbR3f_i^{rgb} \in \mathbb{R}^3)
  • View-dependent feature (fiRDff_i \in \mathbb{R}^{D_f})
  • Discrete instance label (ziz_i; zi=0z_i=0 for background, zi{1,,Ndynamic}z_i\in\{1,\ldots,N_{dynamic}\} for instances)

Each instance zz has a temporally-varying rigid transformation Tz,tSE(3)T_{z,t} \in SE(3) at frame tt, mapping canonical primitive centers to the world frame via

μi,t=Tzi,tμi.\mu_{i,t} = T_{z_i, t}\, \mu_i.

Rendering is performed via “splatting” following SplatAD’s differentiable point-based rasterizer. Each Gaussian projects to a 2D “splat” on the image plane; at pixel pp, the weight is

wi(p)=oiexp[12(π(μi,t)p)TΣi,t1(π(μi,t)p)],w_i(p) = o_i \cdot \exp\left[-\frac{1}{2} (\pi(\mu_{i, t}) - p)^T \Sigma_{i,t}^{-1} (\pi(\mu_{i, t}) - p)\right],

where π()\pi(\cdot) is the camera projection and Σi,t\Sigma_{i,t} projected into image space. The per-pixel color,

C(p)=iactive(p)wi(p)  ci,ci=Decode(firgb,fi,view_dir),C(p) = \sum_{i\in \text{active}(p)} w_i(p)\; c_i,\quad c_i = \mathrm{Decode}(f_i^{rgb}, f_i, \text{view\_dir}),

is alpha-composited in front-to-back order; accumulated opacity is α(p)=1i(1wi(p))\alpha(p) = 1 - \prod_i (1 - w_i(p)).

The same process admits an equivalent volumetric view:

C(r)=T(t)σ(r(t))c(r(t),d)dt,C(r) = \int T(t) \cdot \sigma(r(t)) \cdot c(r(t), d)\, dt,

with σ(r(t))\sigma(r(t)) replaced by oiN(r(t);μi,Σi)o_i N(r(t);\mu_i,\Sigma_i) for each Gaussian.

2. Instance Decomposition Pipeline

Instance-level decomposition occurs in a multi-stage, self-supervised pipeline integrating vision and LiDAR data:

Zero-shot Masking and 3D Association

  • Grounded-SAM-2, prompted with semantic labels (e.g., “car,” “truck”), generates 2D object masks per video frame.
  • Masks are eroded (~few pixels) to reduce misalignments.
  • Temporally nearest LiDAR points are projected into each image; each point is assigned a mask ID for candidate-instance association.
  • DBSCAN clustering (minPts=10, ϵ\epsilon\approx 0.5~m) filters outliers, and the largest cluster per instance is retained.

Feature Correspondence and Rigid Pose Initialization

  • For each instance zz and paired source/target frames (tsrc,ttgt)(t_{src}, t_{tgt}):
    • DINOv3 features are extracted per image and projected onto 3D points.
    • Features are matched across frames by cosine similarity threshold >0.8>0.8.
    • RANSAC+Umeyama estimates rigid SE(3)SE(3) transformation: sample 3 correspondences, estimate T(k)T^{(k)}, count inliers (distance \leq 0.1~m), accept largest-inlier hypothesis (ratio >> 0.5).
    • On success, compose Tz,ttgt=Tz,ttgttsrcTz,tsrcT_{z, t_{tgt}} = T_{z, t_{tgt} \leftarrow t_{src}} T_{z, t_{src}}; transformed points are merged back into the instance’s canonical point cloud set.

This staged approach yields temporally consistent 3D instance tracks from unannotated videos and LiDAR.

3. Motion Modeling and Temporal Smoothing

IDSplat models motion trajectories as per-frame rigid-body transforms Tz,t=[Rz,t  dz,t]SE(3)T_{z,t} = [R_{z,t}\ |\ d_{z,t}] \in SE(3) for each instance zz. Primitive centers are transformed at each timestep as described above.

A factor-graph, implemented in GTSAM, imposes coordinated-turn (CT) smoothing:

  • At every frame tt, the state vector is xt=(TtSE(3),vtR,κtR)x_t = (T_t \in SE(3), v_t \in \mathbb{R}, \kappa_t \in \mathbb{R}), with vtv_t (forward speed) and κt\kappa_t (curvature).
  • Measurement factors for each RANSAC pose T^t\hat{T}_t:

    ϕmeas(Tt,T^t)=ρhuber(Log(Tt1T^t)Σ)\phi_{\mathrm{meas}}(T_t, \hat{T}_t) = \rho_{\mathrm{huber}} \left( \| \mathrm{Log}(T_t^{-1} \hat{T}_t) \|_{\Sigma} \right)

    with rotation std 0.1\sim 0.1~rad and translation std 0.2\sim 0.2~m.

  • CT motion-model factors relate xtx_t and xt+1x_{t+1}:

    θ=κtvtΔtΔx=sinθ/κtΔy=(1cosθ)/κtΔz=0 ΔR=Rz(θ)\theta = \kappa_t v_t \Delta t \quad \Delta x = \sin \theta / \kappa_t \quad \Delta y = (1 - \cos \theta)/\kappa_t \quad \Delta z = 0\ \Delta R = R_z(\theta)

    The residual constrains Tt+1T_{t+1} to the predicted motion from TtT_t.

  • Priors include random walks for vt,κtv_t, \kappa_t, small roll/pitch priors (σ=0.4\sigma=0.4~rad), and moderate curvature (σκ=0.01\sigma_\kappa=0.01).
  • Outlier rejection: measurement factors with large whitened error (>1.345>1.345) are dropped iteratively (up to 10 L–M solves).

This scheme mitigates pose misalignment and tracking errors, yielding physically plausible, temporally smooth trajectories.

4. Loss Functions and Joint Optimization

IDSplat employs joint optimization of all representation and motion parameters, driven by compositional loss terms:

L=λrL1+(1λr)LSSIM+Llidar+λMCMCLMCMCL = \lambda_r L_1 + (1 - \lambda_r) L_{\mathrm{SSIM}} + L_{\mathrm{lidar}} + \lambda_{\mathrm{MCMC}} L_{\mathrm{MCMC}}

with

  • Llidar=λdepthLdepth+λlosLlos+λintenLinten+λraydropLBCEL_{\mathrm{lidar}} = \lambda_{\mathrm{depth}} L_{\mathrm{depth}} + \lambda_{\mathrm{los}} L_{\mathrm{los}} + \lambda_{\mathrm{inten}} L_{\mathrm{inten}} + \lambda_{\mathrm{raydrop}} L_{\mathrm{BCE}}, where LdepthL_{\mathrm{depth}} and LintenL_{\mathrm{inten}} are L2L_2 losses on rendered range/intensity, LlosL_{\mathrm{los}} penalizes opacity before true range, LBCEL_{\mathrm{BCE}} is cross-entropy for LiDAR “ray drops.”
  • LMCMC=λoioi+λΣi,jeigj(Σi)L_{\mathrm{MCMC}} = \lambda_o \sum_i |o_i| + \lambda_\Sigma \sum_{i, j} |\sqrt{\mathrm{eig}_j(\Sigma_i)}|, regularizing occupancy and covariance scale.

Optimization variables encompass all Gaussian parameters, decoder weights, sensor embeddings, and dynamic instance poses Tz,tT_{z,t}. Gaussians and decoders are optimized using Adam for 30k steps, with multi-scale image sampling and scheduled learning rates. Instance poses are initialized with CT-smoothing and further refined via the full loss gradients, integrating both photometric and LiDAR signals.

5. Experimental Results and Evaluation

IDSplat achieves state-of-the-art performance in self-supervised, instance-decomposed scene reconstruction on major autonomous driving benchmarks:

Novel View Synthesis (Waymo Open Dataset)

Comparison metrics across different settings:

Method PSNR SSIM LPIPS DPSNR
IDSplat (DeSiRe-GS) 30.61 0.897 0.163 28.49
DeSiRe-GS (self-sup.) 28.76 0.873 0.193 26.26
Supervised SplatAD 30.80 0.900 0.160 28.97

In the challenging AD-GS protocol (front camera), IDSplat matches or slightly lags supervised SplatAD while outperforming self-supervised AD-GS.

LiDAR Reconstruction

  • On dynamic Waymo scenes, IDSplat attains depth RMSE 0.01\approx 0.01~m, intensity error $0.056$ (SplatAD: $0.055$), ray-drop accuracy 87.5%87.5\% (87.3%87.3\% for SplatAD), and Chamfer distance $1.16$ ($0.98$ for SplatAD), indicating close correspondence to supervised performance.

Generalization and Robustness

IDSplat remains robust under sparse view regimes, maintaining DPSNR >26>26~dB at only 25% input views, where other self-supervised methods degrade by 5\gg 5~dB. On PandaSet (6 cameras, 50% train, annotation-free), IDSplat nearly matches the best supervised baseline in all metrics.

Qualitative Characteristics

IDSplat produces stable, per-instance masks suitable for object editing. It supports downstream applications such as object removal and trajectory rearrangement at test-time, demonstrated by stable masking and editing results.

6. Context, Significance, and Implications

IDSplat advances instance-aware dynamic scene reconstruction without reliance on object trajectory or mask annotations, aligning with the growing need for scalable, sensor-realistic simulation in autonomous driving. Its explicit instance modeling via rigid transformations improves separation of static and dynamic components relative to undifferentiated or purely time-varying representations. The integration of physically consistent coordinated-turn smoothing addresses typical pose drift and tracking errors, enhancing motion realism for simulation or editing.

A plausible implication is the practical deployment of IDSplat in large-scale, self-supervised pipelines, where annotation costs and re-training must be minimized. Its robustness to changing view densities and capacity to generalize to new driving environments further broaden its applicability. The instance-control afforded by the explicit decomposition offers new avenues in interactive simulation, dynamic object behavior research, and virtual scene manipulation.

For implementation and further technical details, refer to the IDSplat primary source (Lindström et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to IDSplat.