AutoSplat: 3D Gaussian Splatting for Autonomous Driving
- AutoSplat is a framework that leverages 3D Gaussian splatting for real-time scene reconstruction and multi-view consistent view synthesis in autonomous driving contexts.
- It employs a two-phase background reconstruction with semantic segmentation and geometric flattening, alongside symmetry-regularized foreground object modeling.
- The method achieves state-of-the-art performance on benchmark datasets, delivering real-time FPS and enhanced fidelity for dynamic urban scenes.
AutoSplat is a scene reconstruction and view synthesis framework for autonomous driving environments, leveraging 3D Gaussian splatting under geometric and appearance constraints to address challenges such as complex backgrounds, dynamic objects, and sparse viewpoints. It produces an explicit 3D representation that enables multi-view consistent, real-time simulation and novel trajectory creation (e.g., virtual lane changes) in urban driving scenarios, and achieves state-of-the-art performance on benchmark datasets (Khan et al., 2024).
1. Pipeline Architecture
AutoSplat ingests sequences of calibrated RGB images , synchronized LiDAR sweeps , and tracked 3D bounding boxes for dynamic foreground objects. The processing pipeline consists of:
- Background Reconstruction: Conducted in two phases.
- Phase 1 segments each image into “road,” “sky,” and “other” with Mask2Former. Gaussians are class-labeled by backprojecting LiDAR points, with geometric flattening constraints imposed on road/sky Gaussians.
- Phase 2 jointly optimizes all background Gaussians, masking foreground regions.
- Foreground (Object) Reconstruction:
- Each object is initialized from a 3D template mesh (e.g., NFI-inverted shapes), placed along tracked trajectories.
- Reflected-Gaussian consistency supervises both observed and unobserved sides by leveraging object symmetry.
- Appearance dynamics are captured via per-Gaussian, per-timestep residuals in spherical harmonics.
- Scene-Level Fusion: All Gaussians are fine-tuned together, including object trajectory corrections, on the full image.
The final representation supports high-fidelity rendering from arbitrary views and trajectory edits (such as simulated lane changes).
2. Mathematical Structure of 3D Gaussian Splatting
The core of AutoSplat’s representation is a set of parameterized 3D Gaussians, each described by:
- Center:
- Covariance: , factorized as where is a rotation matrix and a scaling.
- Opacity: 0
- Color: Encoded in spherical harmonic coefficients 1
A Gaussian’s spatial density at 2 is:
3
For rasterization, spatial Gaussians are projected as 2D Gaussians 4 in the image plane. Pixel color 5 is composited by alpha blending:
6
3. Constraints for Road and Sky Representation
To ensure rendering consistency and realism across multi-view and trajectory perturbations, dedicated geometric constraints are imposed for Gaussians assigned to “road” and “sky”:
- Flatness Constraint: Forcing zero roll 7, zero pitch 8, and minimal vertical extent 9:
0
This term is added to a combined loss for each semantic region 1 during background training:
2
This prevents floating artifacts and preserves plausible parallax under camera or vehicle motion.
4. Reflected-Gaussian Consistency for Object Supervision
AutoSplat enforces appearance and geometric consistency for foreground objects’ unobserved sides by reflecting each Gaussian across the canonical plane of symmetry:
- The symmetry plane has normal 3, with the reflection matrix 4.
- Each Gaussian’s center, rotation, and spherical harmonics are reflected as:
5
where 6 is the Wigner-D matrix for spherical harmonics.
Rendered images of both original and reflected Gaussians are jointly supervised:
7
This approach eliminates geometric “boxiness” and color artifacts on previously unobserved surfaces.
5. Dynamic Appearance via Residual Spherical Harmonics
To accurately reconstruct transient appearance changes (e.g., brake lights, indicators, time-varying shadows), per-Gaussian, per-timestep residuals are learned through a small MLP operating on a temporal embedding:
- Given time-step embedding 8, position 9, and static 0,
1
2
A sparsity regularization 3 (with weight 4) is applied to suppress nonphysical flicker, enabling efficient modeling of appearance dynamics.
6. Training Objective and Optimization
The full learning objective consists of:
- Background loss: 5, optimized for 15K iterations in each of two background phases.
- Foreground loss:
6
- Fusion: The final objective optimizes both sets: 7.
Training schedules use 30K iterations for background (in two phases), 5K for foreground, and 10K for joint fusion, typically on a single NVIDIA V100 GPU.
7. Experimental Protocol and Benchmarking
Datasets
- Pandaset: Ten challenging urban sequences (80 frames each), spanning day/night conditions with multiple dynamic vehicles.
- KITTI: Standard driving sequences, with varying camera coverage for novel view evaluation (25%, 50%, 75% frames held out).
Metrics
- Novel view synthesis: PSNR (↑), SSIM (↑), LPIPS (↓) on held-out, nearby novel views.
- Lateral-shift simulation: FID (↓) for 1–3 m simulated lane changes.
Results Table
| Dataset/Task | AutoSplat | EmerNeRF | SUDS | MARS | NSG | NeRF |
|---|---|---|---|---|---|---|
| Pandaset, test-view PSNR | 27.84 | 27.73 | 25.13 | 23.66 | 22.79 | — |
| Pandaset, test-view SSIM | 0.906 | 0.801 | 0.843 | 0.832 | 0.802 | — |
| Pandaset, test-view LPIPS | 0.291 | 0.394 | 0.426 | 0.502 | 0.578 | — |
| Pandaset, FPS | 26 | — | — | — | — | — |
| Lateral FID 1/2/3 m | 54.7/68.7/83.0 | 68.2/90.4/102.8 | 95.4/122.7/150.8 | — | — | — |
| KITTI, 75% held PSNR | 26.59 | — | 22.77 | 24.23 | 21.53 | 18.56 |
| KITTI, 75% held SSIM | 0.913 | — | 0.797 | 0.845 | 0.673 | 0.557 |
| KITTI, 75% held LPIPS | 0.204 | — | 0.171 | 0.160 | 0.254 | 0.554 |
AutoSplat matches or exceeds previous best reported PSNR/SSIM, and uniquely combines real-time performance (8 FPS), low FID under simulated trajectory shifts, and improved geometric and appearance fidelity, particularly for previously unobserved object sides and dynamic lighting phenomena.
8. Position Within Scene Reconstruction Research
AutoSplat extends conventional 3D Gaussian splatting by: (i) imposing geometric flatness for road/sky, (ii) template-based, symmetry-regularized foreground object initialization, and (iii) efficient dynamic appearance modeling via residual spherical harmonics (Khan et al., 2024). Comparative experiments on view synthesis, trajectory perturbation, and dynamic rendering demonstrate improved accuracy and realism compared to NeRF-based and prior dynamic-scene representations, notably EmerNeRF, SUDS, MARS, NSG, and baseline NeRF.
The framework’s explicit, fast, and physically informed representation contributes new capabilities for multi-view-consistent, real-time simulation environments relevant to the development and evaluation of autonomous vehicles.