- The paper introduces spectral moment supervision to overcome vanishing gradients in 3D Gaussian Splatting, ensuring robust tracking from zero-overlap initializations.
- It shifts optimization to the frequency domain using principled frequency annealing, which provides a global basin of attraction during tracking.
- Experimental results show improved PSNR, SSIM, and LPIPS metrics over traditional pixel-based losses on both synthetic and real-world datasets.
SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision
Introduction and Motivation
3D Gaussian Splatting (3DGS) has established itself as a potent framework for real-time, photorealistic novel view synthesis and is increasingly adopted in model-based video tracking scenarios. Despite its differentiable rendering property, deploying 3DGS for robust, analysis-by-synthesis tracking remains precarious due to the local support characteristics of Gaussian primitives. Conventional spatial losses (e.g., pixel-wise L2 or deep-feature-based objectives like LPIPS) require spatial overlap between rendered and target structures for effective gradient propagation. In the absence of overlap—for instance, under substantial initialization misalignment—the spatial objective's gradient vanishes, stranding the optimizer and leading to degenerate solutions.
The paper introduces SpectralSplats, a framework that circumvents the intrinsic locality of photometric losses by transitioning optimization to the frequency domain. Through Spectral Moment supervision, the authors address the vanishing gradient pathology, establishing a global basin of attraction and enabling successful model-based tracking even from zero-overlap initializations.
Figure 1: SpectralSplats enables robust tracking from zero-overlap initializations by shifting supervision from pixels to frequency-based moments, generating global gradient signals where spatial losses fail.
Methodology
The Vanishing Gradient Problem in 3D Gaussian Splatting
The differentiable 3DGS tracking pipeline operates by minimizing the photometric distance between rendered assets, parameterized by transformations Θ, and target observations. Due to the compact and local support of the Gaussian splats, the gradients with respect to parameters Θ strictly vanish when there is no spatial overlap. The mathematical decomposition of the spatial photometric loss reveals that the “self-term” is zero for in-plane translations, and the “target supervision” term vanishes globally under disjoint support, fundamentally precluding optimization progress. Standard workarounds, such as reliance on strong priors or manual initializations, break down for class-agnostic, in-the-wild applications.
Spectral Moment Supervision: Replacing Locality with Globality
To bypass the spatial locality trap, SpectralSplats redefines the objective as a moment-matching task in the frequency domain. Moments (both polynomial and spectral) naturally integrate image content against global fields. In particular, by projecting both rendered and target images onto a set of global complex sinusoids—Fourier basis functions—the optimizer receives valid gradients everywhere, regardless of overlap. Under translations, displacements correspond to phase shifts in spectral features, yielding strong, non-vanishing gradients.
This duality is mathematically rooted in Parseval’s theorem, which shows a full spectral basis is strictly equivalent to pixel-wise L2 up to basis change. However, naively activating all frequencies at once reconstructs the original loss landscape and reinstates local minima (e.g., phase-wrapping), especially for high frequencies.
Figure 2: Analysis of optimization landscapes under large initial displacement, demonstrating that pixel-based and high-frequency spectral losses admit local minima and flat gradients, while annealed spectral supervision maintains a global basin of attraction.
Principled Frequency Annealing
To exploit the convexity of low-frequency moments and the precision of high-frequency components without incurring phase-wrapping pitfalls, SpectralSplats dynamically anneals the active frequency bandwidth during optimization. The authors formally derive that for a translation vector d, safe inclusion of frequency ω requires ∣ωTd∣<π, ensuring global attraction to the correct minimum.
As optimization reduces ∥d∥, higher frequencies can safely be added. The schedule for frequency annealing is analytically shown to be linear in logarithmic frequency space, matching earlier empirical insights from neural field literature, but now with rigorous justification. This process results in exponential contraction of spatial error, transitioning smoothly from global to local supervision as alignment is restored.
Figure 3: 2D optimization demonstration: pixel MSE supervision (top) cannot escape initialization, whereas spectral supervision (bottom) drives convergence to the correct target via coherent global motion.
Experimental Evaluation
Synthetic and Real-World Datasets
The framework is benchmarked on two task distributions: synthetic dynamic sequences from SC4D, and the challenging, real-world GART Dog dataset comprising in-the-wild captures with significant misalignments and illumination mismatch.
Figure 4: Mean PSNR, SSIM, and LPIPS versus initial shift on GART and SC4D; pixel-only supervision rapidly worsens with larger shifts, while SpectralSplats maintains stable and superior results.
Across both MLP-based and sparse control point deformation models, SpectralSplats is implemented as a drop-in replacement for initial loss objectives. Upon achieving spatial proximity, training transitions to high-frequency spatial (or perceptual) objectives for final refinement. Metrics such as PSNR, SSIM, and LPIPS are evaluated as functions of controlled initial displacement.
Robustness to Initialization and Comparative Results
Under increasing initialization shift, standard pixel and LPIPS-based trackings fail catastrophically, with rendered assets either drifting outside the frame or stabilizing at incorrect structures. SpectralSplats consistently reconstructs target pose and structure for both training and novel views, with quantitative improvements in PSNR and perceptual metrics.
Figure 5: Qualitative SC4D results under moderate spatial shift: SpectralSplats recovers correct pose and fine structure, while pixel-only baselines often collapse or diverge from the target.
On GART, per-asset analysis over large spatial perturbations confirms that SpectralSplats provides significant improvements in PSNR, SSIM, and LPIPS, especially as the alignment deteriorates for baseline methods.
Figure 6: GART qualitative comparison under strong spatial misalignment: pixel-only optimization yields blur and misalignment, while SpectralSplats achieves sharper reconstruction and reliable pose recovery.
Ablations and Limitations
Ablative studies show that SpectralSplats matches or outperforms spatial supervision even in the well-initialized setting, confirming that its improved basin of attraction does not degrade performance in the conventional regime. The framework also demonstrates stability across various regularization choices and deformation parameterizations.
Implications and Future Work
By transitioning supervision from the spatial to the frequency domain, SpectralSplats makes 3DGS tracking initialization-agnostic, rendering global optimization tractable and removing dependence on manual alignment or strong class priors. This result expands the deployability of model-based differentiable tracking in unconstrained or rapidly varying acquisition settings, notably benefitting dynamic scene understanding, video-driven animation, and markerless motion capture.
The method's current scope is restricted to tracking on pre-initialized canonical assets. Extending frequency-based, global supervision to full dynamic scene reconstruction—i.e., joint optimization of canonical geometry and temporal deformation from raw videos—would provide a fully end-to-end paradigm free from initialization bottlenecks. Exploration of alternate moment bases for handling category-specific articulation and extreme nonrigidity also remains a promising research direction.
Figure 7: Illustration of the real-world GART setting, showing substantial pose, outline, and appearance differences between initial assets and supervision frames, highlighting the challenge addressed by SpectralSplats.
Conclusion
SpectralSplats presents a globally robust, model-agnostic supervisory scheme for 3DGS-based tracking. Its Spectral Moment loss with principled frequency annealing substantially widens the basin of attraction for differentiable tracking, overcoming the inherent limitations of local supervision and unlocking new possibilities for flexible, initialization-free vision systems (2603.24036).