SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision

Published 25 Mar 2026 in cs.CV | (2603.24036v1)

Abstract: 3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer "in the wild" remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target's local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this "vanishing gradient" problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.

Abstract PDF Upgrade to Chat

Summary

The paper introduces spectral moment supervision to overcome vanishing gradients in 3D Gaussian Splatting, ensuring robust tracking from zero-overlap initializations.
It shifts optimization to the frequency domain using principled frequency annealing, which provides a global basin of attraction during tracking.
Experimental results show improved PSNR, SSIM, and LPIPS metrics over traditional pixel-based losses on both synthetic and real-world datasets.

SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision

Introduction and Motivation

3D Gaussian Splatting (3DGS) has established itself as a potent framework for real-time, photorealistic novel view synthesis and is increasingly adopted in model-based video tracking scenarios. Despite its differentiable rendering property, deploying 3DGS for robust, analysis-by-synthesis tracking remains precarious due to the local support characteristics of Gaussian primitives. Conventional spatial losses (e.g., pixel-wise $L_2$ or deep-feature-based objectives like LPIPS) require spatial overlap between rendered and target structures for effective gradient propagation. In the absence of overlap—for instance, under substantial initialization misalignment—the spatial objective's gradient vanishes, stranding the optimizer and leading to degenerate solutions.

The paper introduces SpectralSplats, a framework that circumvents the intrinsic locality of photometric losses by transitioning optimization to the frequency domain. Through Spectral Moment supervision, the authors address the vanishing gradient pathology, establishing a global basin of attraction and enabling successful model-based tracking even from zero-overlap initializations.

Figure 1: SpectralSplats enables robust tracking from zero-overlap initializations by shifting supervision from pixels to frequency-based moments, generating global gradient signals where spatial losses fail.

Methodology

The Vanishing Gradient Problem in 3D Gaussian Splatting

The differentiable 3DGS tracking pipeline operates by minimizing the photometric distance between rendered assets, parameterized by transformations $\Theta$ , and target observations. Due to the compact and local support of the Gaussian splats, the gradients with respect to parameters $\Theta$ strictly vanish when there is no spatial overlap. The mathematical decomposition of the spatial photometric loss reveals that the “self-term” is zero for in-plane translations, and the “target supervision” term vanishes globally under disjoint support, fundamentally precluding optimization progress. Standard workarounds, such as reliance on strong priors or manual initializations, break down for class-agnostic, in-the-wild applications.

Spectral Moment Supervision: Replacing Locality with Globality

To bypass the spatial locality trap, SpectralSplats redefines the objective as a moment-matching task in the frequency domain. Moments (both polynomial and spectral) naturally integrate image content against global fields. In particular, by projecting both rendered and target images onto a set of global complex sinusoids—Fourier basis functions—the optimizer receives valid gradients everywhere, regardless of overlap. Under translations, displacements correspond to phase shifts in spectral features, yielding strong, non-vanishing gradients.

This duality is mathematically rooted in Parseval’s theorem, which shows a full spectral basis is strictly equivalent to pixel-wise $L_2$ up to basis change. However, naively activating all frequencies at once reconstructs the original loss landscape and reinstates local minima (e.g., phase-wrapping), especially for high frequencies.

Figure 2: Analysis of optimization landscapes under large initial displacement, demonstrating that pixel-based and high-frequency spectral losses admit local minima and flat gradients, while annealed spectral supervision maintains a global basin of attraction.

Principled Frequency Annealing

To exploit the convexity of low-frequency moments and the precision of high-frequency components without incurring phase-wrapping pitfalls, SpectralSplats dynamically anneals the active frequency bandwidth during optimization. The authors formally derive that for a translation vector $\mathbf{d}$ , safe inclusion of frequency $\omega$ requires $|\omega^T \mathbf{d}| < \pi$ , ensuring global attraction to the correct minimum.

As optimization reduces $\|\mathbf{d}\|$ , higher frequencies can safely be added. The schedule for frequency annealing is analytically shown to be linear in logarithmic frequency space, matching earlier empirical insights from neural field literature, but now with rigorous justification. This process results in exponential contraction of spatial error, transitioning smoothly from global to local supervision as alignment is restored.

Figure 3: 2D optimization demonstration: pixel MSE supervision (top) cannot escape initialization, whereas spectral supervision (bottom) drives convergence to the correct target via coherent global motion.

Experimental Evaluation

Synthetic and Real-World Datasets

The framework is benchmarked on two task distributions: synthetic dynamic sequences from SC4D, and the challenging, real-world GART Dog dataset comprising in-the-wild captures with significant misalignments and illumination mismatch.

Figure 4: Mean PSNR, SSIM, and LPIPS versus initial shift on GART and SC4D; pixel-only supervision rapidly worsens with larger shifts, while SpectralSplats maintains stable and superior results.

Across both MLP-based and sparse control point deformation models, SpectralSplats is implemented as a drop-in replacement for initial loss objectives. Upon achieving spatial proximity, training transitions to high-frequency spatial (or perceptual) objectives for final refinement. Metrics such as PSNR, SSIM, and LPIPS are evaluated as functions of controlled initial displacement.

Robustness to Initialization and Comparative Results

Under increasing initialization shift, standard pixel and LPIPS-based trackings fail catastrophically, with rendered assets either drifting outside the frame or stabilizing at incorrect structures. SpectralSplats consistently reconstructs target pose and structure for both training and novel views, with quantitative improvements in PSNR and perceptual metrics.

Figure 5: Qualitative SC4D results under moderate spatial shift: SpectralSplats recovers correct pose and fine structure, while pixel-only baselines often collapse or diverge from the target.

On GART, per-asset analysis over large spatial perturbations confirms that SpectralSplats provides significant improvements in PSNR, SSIM, and LPIPS, especially as the alignment deteriorates for baseline methods.

Figure 6: GART qualitative comparison under strong spatial misalignment: pixel-only optimization yields blur and misalignment, while SpectralSplats achieves sharper reconstruction and reliable pose recovery.

Ablations and Limitations

Ablative studies show that SpectralSplats matches or outperforms spatial supervision even in the well-initialized setting, confirming that its improved basin of attraction does not degrade performance in the conventional regime. The framework also demonstrates stability across various regularization choices and deformation parameterizations.

Implications and Future Work

By transitioning supervision from the spatial to the frequency domain, SpectralSplats makes 3DGS tracking initialization-agnostic, rendering global optimization tractable and removing dependence on manual alignment or strong class priors. This result expands the deployability of model-based differentiable tracking in unconstrained or rapidly varying acquisition settings, notably benefitting dynamic scene understanding, video-driven animation, and markerless motion capture.

The method's current scope is restricted to tracking on pre-initialized canonical assets. Extending frequency-based, global supervision to full dynamic scene reconstruction—i.e., joint optimization of canonical geometry and temporal deformation from raw videos—would provide a fully end-to-end paradigm free from initialization bottlenecks. Exploration of alternate moment bases for handling category-specific articulation and extreme nonrigidity also remains a promising research direction.

Figure 7: Illustration of the real-world GART setting, showing substantial pose, outline, and appearance differences between initial assets and supervision frames, highlighting the challenge addressed by SpectralSplats.

Conclusion

SpectralSplats presents a globally robust, model-agnostic supervisory scheme for 3DGS-based tracking. Its Spectral Moment loss with principled frequency annealing substantially widens the basin of attraction for differentiable tracking, overcoming the inherent limitations of local supervision and unlocking new possibilities for flexible, initialization-free vision systems (2603.24036).

Markdown Report Issue