SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes

Published 3 Apr 2026 in cs.CV, cs.GR, and cs.RO | (2604.03462v1)

Abstract: Feed-forward 3D Gaussian Splatting methods have achieved impressive reconstruction quality for autonomous driving scenes, yet they entangle scene geometry with transient appearance properties such as lighting, weather, and time of day. This coupling prevents relighting, appearance transfer, and consistent rendering across multi-traversal data captured under varying environmental conditions. We present SpectralSplat, a method that disentangles appearance from geometry within a feed-forward Gaussian Splatting framework. Our key insight is to factor color prediction into an appearance-agnostic base stream and and appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. To enforce disentanglement, we train with paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion based generative refinement, and supervise with complementary consistency, reconstruction, cross-appearance, and base color losses. We further introduce an appearance-adaptable temporal history that stores appearance-agnostic features, enabling accumulated Gaussians to be re-rendered under arbitrary target appearances. Experiments demonstrate that SpectralSplat preserves the reconstruction quality of the underlying backbone while enabling controllable appearance transfer and temporally consistent relighting across driving sequences.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a novel feed-forward framework that disentangles appearance from geometry to enable controllable relighting in driving scenes.
It employs a global appearance embedding from DINOv2 and factored color prediction to generate both invariant base colors and conditionally adapted colors.
Empirical results on Waymo and nuScenes demonstrate superior reconstruction fidelity and over 100× faster inference compared to optimization-based methods.

SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes

Motivation and Problem Context

Feed-forward 3D Gaussian Splatting (3DGS) methods have advanced real-time 3D scene reconstruction for autonomous driving applications by directly regressing Gaussian primitives from multi-view video, achieving high-fidelity novel view synthesis without per-scene optimization. However, standard 3DGS pipelines encode scene geometry and appearance in a tightly coupled manner, with colors "baked in" to input lighting, weather, and exposure conditions. This entanglement prevents controllable relighting, impedes appearance transfer, and disrupts temporal consistency across accumulated frames. Moreover, multi-traversal datasets—where the same scenes are captured repeatedly under varying conditions—cannot be effectively leveraged, limiting both coverage and robustness.

SpectralSplat Architecture and Methodological Innovations

SpectralSplat introduces a feed-forward framework that explicitly disentangles appearance from geometry, enabling consistent relighting and appearance transfer while maintaining geometric integrity. The method builds upon UniSplat’s (3D latent scaffold with dual-branch Gaussian decoder) backbone, augmenting it with three key modules:

Global Appearance Embedding: Derived from DINOv2 patch tokens, a lightweight encoder $\phi$ compresses per-camera features into a single latent vector $\mathbf{a} \in \mathbb{R}^{64}$ , capturing scene-level appearance properties (lighting mood, weather tone) in a globally shared code, preventing per-point drift.
Factored Color Prediction: A color MLP is evaluated twice for each Gaussian: once with a zero embedding (appearance-agnostic base color), and once with the actual appearance embedding (adapted color). This produces both canonical and conditionally adapted colors, which allows explicit supervision of disentanglement via paired losses.
Appearance-Adaptable Temporal History: Gaussians accumulated across frames cache appearance-agnostic features so that at recall time, colors are recomputed with the current appearance embedding, eliminating inconsistent coloring from multi-appearance history buffers.

Figure 1: Appearance-disentangled Gaussian reconstruction; base colors remain consistent across relighting conditions, adapted colors match target appearance, and swapped embedding enables robust appearance transfer.

Hybrid Relighting Pipeline for Paired Supervision

Training requires paired images of identical geometry under different appearances. SpectralSplat generates these using a hybrid relighting pipeline:

Intrinsic Decomposition: MVInverse computes base color and normal maps, producing Lambertian re-rendering for global consistency across views.
Generative Refinement: IC-Light, a diffusion-based relighting model, generates photorealistic lighting effects but lacks multi-view consistency.
Frequency-Aware Latent Guidance: During DDIM sampling steps, low-frequency structural priors from MVInverse are injected to enforce geometric consistency, while high-frequency details are refined by IC-Light.
Figure 2: Relighting pipeline comparison; MVInverse yields flat but consistent outputs, IC-Light provides photorealism but is inconsistent, hybrid pipeline achieves both.

Supervision Framework and Training Objective

SpectralSplat's supervision comprises four complementary losses:

Base Invariance ( $\mathcal{L}_{\text{inv}}$ ): Appearance-agnostic base color renders must be invariant to input conditions.
Augmented Reconstruction ( $\mathcal{L}_{\text{aug}}$ ): Adapted renders must reproduce augmented targets.
Appearance-Swap Consistency ( $\mathcal{L}_{\text{swap}}$ ): Swapping embeddings between paired samples supervises cross-appearance transfer.
Base Color Alignment ( $\mathcal{L}_{\text{base}}$ ): Base colors are regularized toward physics-based pseudo-ground-truth.

These are simultaneously optimized with standard feed-forward reconstruction losses (MSE, LPIPS, depth, dynamics). The base stream receives gradients only from disentanglement losses, preventing appearance leakage.

Quantitative and Qualitative Results

SpectralSplat achieves superior reconstruction quality over prior feed-forward methods (UniSplat, MVSplat) on Waymo and nuScenes datasets, measured by PSNR, SSIM, and LPIPS. Notably, disentanglement does not degrade reconstruction metrics.

Cross-Appearance Evaluation: Appearance can be transferred between geometry instances by swapping the appearance embedding, yielding coherent renders with PSNR differences quantifying the separation strength. Embedding clusters by illumination type, confirming semantically meaningful encoding.

Figure 3: Cross-appearance results; base colors are invariant, adapted colors match lighting, embedding swaps transfer appearance without changing geometry.

Comparison with Optimization-Based Methods: Against WildGaussians, SpectralSplat achieves competitive perceptual quality at $>100\times$ faster inference speed, with reduced diffusion artifacts.

Figure 4: WildGaussians exhibits artifacts under appearance transfer, SpectralSplat maintains structural and perceptual fidelity.

Appearance Transfer Grid: Explicit embedding control enables flexible relighting where geometry is held fixed and only appearance is varied.

Figure 5: Appearance transfer across scenes; embedding from a reference image modulates global lighting and tone without compromising geometric features.

Ablation and Analysis

Loss ablation studies reveal that $\mathcal{L}_{\text{swap}}$ is critical for effective disentanglement. Removing base color anchor ( $\mathcal{L}_{\text{base}}$ ) degrades transfer and impedes temporal accumulation, confirming that base stream regularization is indispensable.

Theoretical and Practical Implications

SpectralSplat pioneers feed-forward disentanglement in large-scale driving scenes, unlocking controllable relighting, consistent rendering across traversals, and efficient leveraging of diverse environmental data. Practically, this benefits simulation, planning, and robust modeling for autonomous vehicles. Theoretically, it advances disentanglement paradigms by demonstrating that global appearance codes and paired supervision can be sufficient for realistic, controllable 3D scene rendering at scale.

Future Prospects

Potential extensions include spatially-varying appearance embeddings for object-level or localized relighting, leveraging naturally diverse multi-traversal data to minimize dependence on synthetic augmentation, and integration with generative world models for fully controllable simulation environments.

Conclusion

SpectralSplat addresses a key limitation of traditional feed-forward 3DGS pipelines by disentangling appearance from geometry using a global embedding, factored color prediction, hybrid relighting supervision, and appearance-adaptable temporal history. It enables consistent, controllable rendering in driving scenes, with robust empirical and perceptual results. The method’s scalability and modularity suggest applicability to broader domains, and its disentanglement framework paves the way for further innovations in dynamic scene generation and simulation.

Markdown Report Issue