- The paper introduces a novel feed-forward framework that disentangles appearance from geometry to enable controllable relighting in driving scenes.
- It employs a global appearance embedding from DINOv2 and factored color prediction to generate both invariant base colors and conditionally adapted colors.
- Empirical results on Waymo and nuScenes demonstrate superior reconstruction fidelity and over 100× faster inference compared to optimization-based methods.
SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes
Motivation and Problem Context
Feed-forward 3D Gaussian Splatting (3DGS) methods have advanced real-time 3D scene reconstruction for autonomous driving applications by directly regressing Gaussian primitives from multi-view video, achieving high-fidelity novel view synthesis without per-scene optimization. However, standard 3DGS pipelines encode scene geometry and appearance in a tightly coupled manner, with colors "baked in" to input lighting, weather, and exposure conditions. This entanglement prevents controllable relighting, impedes appearance transfer, and disrupts temporal consistency across accumulated frames. Moreover, multi-traversal datasets—where the same scenes are captured repeatedly under varying conditions—cannot be effectively leveraged, limiting both coverage and robustness.
SpectralSplat Architecture and Methodological Innovations
SpectralSplat introduces a feed-forward framework that explicitly disentangles appearance from geometry, enabling consistent relighting and appearance transfer while maintaining geometric integrity. The method builds upon UniSplat’s (3D latent scaffold with dual-branch Gaussian decoder) backbone, augmenting it with three key modules:
- Global Appearance Embedding: Derived from DINOv2 patch tokens, a lightweight encoder ϕ compresses per-camera features into a single latent vector a∈R64, capturing scene-level appearance properties (lighting mood, weather tone) in a globally shared code, preventing per-point drift.
- Factored Color Prediction: A color MLP is evaluated twice for each Gaussian: once with a zero embedding (appearance-agnostic base color), and once with the actual appearance embedding (adapted color). This produces both canonical and conditionally adapted colors, which allows explicit supervision of disentanglement via paired losses.
- Appearance-Adaptable Temporal History: Gaussians accumulated across frames cache appearance-agnostic features so that at recall time, colors are recomputed with the current appearance embedding, eliminating inconsistent coloring from multi-appearance history buffers.
















Figure 1: Appearance-disentangled Gaussian reconstruction; base colors remain consistent across relighting conditions, adapted colors match target appearance, and swapped embedding enables robust appearance transfer.
Hybrid Relighting Pipeline for Paired Supervision
Training requires paired images of identical geometry under different appearances. SpectralSplat generates these using a hybrid relighting pipeline:
Supervision Framework and Training Objective
SpectralSplat's supervision comprises four complementary losses:
- Base Invariance (Linv​): Appearance-agnostic base color renders must be invariant to input conditions.
- Augmented Reconstruction (Laug​): Adapted renders must reproduce augmented targets.
- Appearance-Swap Consistency (Lswap​): Swapping embeddings between paired samples supervises cross-appearance transfer.
- Base Color Alignment (Lbase​): Base colors are regularized toward physics-based pseudo-ground-truth.
These are simultaneously optimized with standard feed-forward reconstruction losses (MSE, LPIPS, depth, dynamics). The base stream receives gradients only from disentanglement losses, preventing appearance leakage.
Quantitative and Qualitative Results
SpectralSplat achieves superior reconstruction quality over prior feed-forward methods (UniSplat, MVSplat) on Waymo and nuScenes datasets, measured by PSNR, SSIM, and LPIPS. Notably, disentanglement does not degrade reconstruction metrics.
Cross-Appearance Evaluation: Appearance can be transferred between geometry instances by swapping the appearance embedding, yielding coherent renders with PSNR differences quantifying the separation strength. Embedding clusters by illumination type, confirming semantically meaningful encoding.







Figure 3: Cross-appearance results; base colors are invariant, adapted colors match lighting, embedding swaps transfer appearance without changing geometry.
Comparison with Optimization-Based Methods: Against WildGaussians, SpectralSplat achieves competitive perceptual quality at >100× faster inference speed, with reduced diffusion artifacts.


Figure 4: WildGaussians exhibits artifacts under appearance transfer, SpectralSplat maintains structural and perceptual fidelity.
Appearance Transfer Grid: Explicit embedding control enables flexible relighting where geometry is held fixed and only appearance is varied.






























Figure 5: Appearance transfer across scenes; embedding from a reference image modulates global lighting and tone without compromising geometric features.
Ablation and Analysis
Loss ablation studies reveal that Lswap​ is critical for effective disentanglement. Removing base color anchor (Lbase​) degrades transfer and impedes temporal accumulation, confirming that base stream regularization is indispensable.
Theoretical and Practical Implications
SpectralSplat pioneers feed-forward disentanglement in large-scale driving scenes, unlocking controllable relighting, consistent rendering across traversals, and efficient leveraging of diverse environmental data. Practically, this benefits simulation, planning, and robust modeling for autonomous vehicles. Theoretically, it advances disentanglement paradigms by demonstrating that global appearance codes and paired supervision can be sufficient for realistic, controllable 3D scene rendering at scale.
Future Prospects
Potential extensions include spatially-varying appearance embeddings for object-level or localized relighting, leveraging naturally diverse multi-traversal data to minimize dependence on synthetic augmentation, and integration with generative world models for fully controllable simulation environments.
Conclusion
SpectralSplat addresses a key limitation of traditional feed-forward 3DGS pipelines by disentangling appearance from geometry using a global embedding, factored color prediction, hybrid relighting supervision, and appearance-adaptable temporal history. It enables consistent, controllable rendering in driving scenes, with robust empirical and perceptual results. The method’s scalability and modularity suggest applicability to broader domains, and its disentanglement framework paves the way for further innovations in dynamic scene generation and simulation.