Spectral Decomposition of Gradient Trajectories

Updated 16 May 2026

Spectral decomposition of gradient trajectories is a method that represents gradient flows via dominant eigenstructures, offering clear insights into optimization dynamics.
It integrates classical linear eigenspace analysis with nonlinear eigenfunctions and basis function decompositions to capture convergence and alignment properties across various models.
The framework informs MoE architectures by decoupling shared and unique gradient components, enhancing expert specialization and overall model performance.

Spectral decomposition of gradient trajectories, broadly construed, denotes the process of expressing the evolution of parameters or states under (continuous or discrete) gradient flows in terms of dominant directions or modes, often associated with eigenstructures of (possibly nonlinear or data-dependent) operators. This framework is fundamental for understanding optimization dynamics, generalization, expressivity, and specialization across diverse problem domains, including deep learning, inverse problems, variational image processing, and large-scale conditional computation (MoE, etc.). Modern developments have extended classical linear decompositions to nonlinear flows, scale-space representations, and task-aligned functional bases, including rigorous analysis and algorithmic proposals for extracting or enforcing such structure during optimization.

1. Mathematical Foundations of Spectral Decomposition in Gradient Flows

The classical theory starts from linear gradient flows, where the dynamics can be fully described in the eigenspaces of the quadratic form (i.e., the Hessian of $f$ ), with each coordinate decaying independently at rates set by the corresponding eigenvalues. The trajectory $x(t)$ in a neighborhood of a nondegenerate critical point $x^*$ can be written as

$x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$

with $e_i$ orthonormal eigenvectors of $\nabla^2 f(x^*)$ and $\dot\alpha_i(t) = -\lambda_i \alpha_i(t) + O(\|\alpha(t)\|^2)$ , so spectral decomposition reduces gradient evolution to coordinatewise exponential decay governed by the spectrum of the Hessian (Bégout et al., 13 Apr 2026).

For nonlinear, possibly non-smooth, or data-dependent flows, the notion is generalized via:

Basis function decomposition: For gradient descent in function space, one considers expansions $f_\theta(x) = \sum_i \beta_i(\theta) \phi_i(x)$ over an orthonormal basis $\{\phi_i\}$ (e.g., conjugate kernel eigenvectors in DNNs), with the gradient dynamics of $\beta_i$ analyzed to study their possibly monotonic, decoupled evolution and convergence (Ma et al., 2022).
Nonlinear eigenfunctions: In flows generated by absolutely one-homogeneous convex functionals $x(t)$ 0, the decomposition involves nonlinear eigenvectors $x(t)$ 1 satisfying $x(t)$ 2, with the time evolution of the minimal-norm subgradients $x(t)$ 3 producing a nonlinear spectral measure (Bungert et al., 2019).

Thus, spectral decomposition in the context of gradient trajectories is the reduction of the flow, evolution, or parameter update trajectory to its dominant modes, as defined by either linear, data-dependent, or nonlinear operator-induced eigenstructures.

2. Spectral Decomposition in Mixture-of-Experts: SD-MoE Analysis

Recent investigations into Mixture-of-Experts (MoE) architectures for LLMs, particularly "SD-MoE: Spectral Decomposition for Effective Expert Specialization" (Huang et al., 13 Feb 2026), highlight the role of spectral decomposition in both the parameters and gradient trajectories of experts. Core findings include:

Gradient structure: For expert $x(t)$ 4 with weight matrix $x(t)$ 5, the gradient over a mini-batch routed to $x(t)$ 6 is $x(t)$ 7. Singular value decomposition (SVD) of $x(t)$ 8 reveals a highly low-rank structure, with the top 1% of singular values capturing the majority of energy.
Cross-expert alignment: The dominant gradient subspaces $x(t)$ 9 (top- $x^*$ 0 right singular vectors for each expert) exhibit high alignment across experts ( $x^*$ 1 for "head" directions), while the "tail" directions are nearly orthogonal.
Source of low-rank structure: This arises from a persistent low-rank subspace $x^*$ 2 in the activations $x^*$ 3; consequently, a shared gradient component $x^*$ 4 dominates all experts, impairing specialization.
Gating mechanism: Router weights also align with these dominant directions, compounding the lack of expert diversification.

The SD-MoE approach addresses this by explicitly decomposing both parameter and gradient spaces into shared (dominated by $x^*$ 5) and unique (orthogonal complement) subspaces. Parameters are initialized and updated separately in these subspaces, enforcing specialization, lowering cross-expert alignment, and producing empirical gains in both accuracy and optimization stability.

3. Spectral Decomposition Frameworks: Nonlinear and Basis Function Approaches

Spectral decompositions for gradient flows have been extended into nonlinear and function-space domains via several rigorous frameworks:

Nonlinear spectral flows based on one-homogeneous functionals: The theory developed in (Bungert et al., 2019) and (Bungert et al., 2019) shows that, for $x^*$ 6 convex, absolutely one-homogeneous and under mild conditions, gradient flows $x^*$ 7 decompose any datum into a (possibly discrete) set of nonlinear eigenfunctions. The flow's time-derivative $x^*$ 8 converges, as $x^*$ 9 (where $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 0), to an extinction profile $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 1 solving $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 2. Iterated subtraction of such extinction profiles yields a sparse nonlinear "Fourier" expansion:

$x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 3

A key structural difference is the frequent absence of orthogonality and subspace closure, yet exact decompositions are provable for 1D total variation, polyhedral norms, and graph 1-Laplacian (Bungert et al., 2019, Bungert et al., 2019).

Basis function decomposition for gradient descent: In (Ma et al., 2022), the solution trajectory of GD on several tasks is projected onto a task-adapted orthonormal basis (e.g., eigenvectors of the learned conjugate kernel). Projected coefficients evolve nearly monotonically, and the bulk of learning progress is captured by the leading basis directions. Theoretical analysis covers both convex and nonconvex scenarios (symmetric matrix factorization, symmetric tensor decomposition), and empirical work demonstrates the incremental learning of dominant modes in deep nets.

4. Local and Global Behavior: Alignment, Spectral Gaps, and Nonlinearity

Fine-grained analysis relates the spectral decomposition of gradient trajectories to convergence rates and alignment phenomena:

Spectral coordinates near critical points: In the local vicinity of $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 4, expressing $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 5 in the Hessian-eigenbasis provides a coordinatewise ODE $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 6, enforcing exponential contraction at rates set by the spectrum (Bégout et al., 13 Apr 2026). In the discrete case, step size and nonlinearity can degrade the spectral gap-determined rate, often reducing convergence to that governed by the smallest eigenvalue $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 7.
Directional selection and volume concentration: Almost all trajectories, when projected in spectral coordinates, asymptotically align with the slowest (most weakly contracting) direction. Volume arguments (concentration in valleys/talwegs) quantify the dominance of these directions in high dimensions (Bégout et al., 13 Apr 2026).

In strongly nonlinear regimes, "pure" mode separation can break, but the geometric and variational framework still supports a spectral (though possibly less orthogonal) description for the late-time/steady-state behavior.

5. Algorithmic Extraction and Regularization via Spectral Decomposition

Spectral decomposition is operationalized in a variety of algorithmic frameworks:

Dynamic Mode Decomposition (DMD)/OrthoNS: Modes of nonlinear homogeneous flows can be extracted via time-series analysis (SVD-fitting), adaptive time sampling, and (if the operator is homogeneous) matching of extracted components to nonlinear eigenfunctions (Orthogonal Nonlinear Spectral decomposition, OrthoNS) (Cohen et al., 2020). Under proper sampling, DMD can recover the leading modes with zero error for flows generated by homogeneous operators, extending the reach of classical linear spectral methods to nonlinear PDEs, image gradient flows, and structured data.
Spectral decoupling in MoE: As in SD-MoE (Huang et al., 13 Feb 2026), explicit SVD-based decomposition (parameter initialization and per-gradient update) regulates the capacity allocation among experts, reduces cross-expert interference, and improves specialization and downstream generalization.

The following table summarizes several contexts and their principal spectral decomposition approaches:

Context/Model	Decomposition Basis	Spectral Modes
Linear GD (Hessian)	Quadratic eigenspace	Eigenvectors/eigenvalues
Deep Nets (post-Kernel)	Kernel eigenvectors (empirical)	Basis projections
Nonlinear/TV flows	Nonlinear eigenfunctions of $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 8	Extinction profiles
MoE (SD-MoE)	SVD of gradient/parameter matrices	Shared vs. unique subspaces

6. Implications, Applications, and Empirical Consequences

Spectral decomposition of gradient trajectories impacts both theoretical understanding and practical performance:

In MoE architectures, failure to decouple dominant gradient directions leads to unused capacity, shared specialized subspaces, and suboptimal routing. Enforcing explicit subspace decomposition improves both accuracy (3% average gain across benchmarks), learning rate stability (4x higher tolerable LR), and effective learning speed (Huang et al., 13 Feb 2026).
In DNNs, the incremental, monotonic excitation of principal axis coefficients explains phase transitions in optimization, alignment, and expressivity, and links to double descent and generalization behavior (Ma et al., 2022).
For variational problems (e.g., imaging), L $x(t)-x^* = \sum_{i=1}^d \alpha_i(t) e_i$ 9-based nonlinear spectral decompositions yield contrast-invariant, size-structured decompositions with applications to segmentation and multi-scale feature extraction (Zeune et al., 2017).
In graph-based clustering, extinction profile-based decompositions enable direct extraction of combinatorial indicators of communities, generalizing classical spectral clustering to nonlinear structures (Bungert et al., 2019).

A plausible implication is that interpretability, robustness, and hardware utilization in large models may systematically benefit from spectral analysis and control of optimization trajectories in both parameter and gradient spaces. In domains with highly structured data (e.g., language, vision), universality of low-rank phenomena leads to strong cross-expert alignment, demanding explicit spectral regulation for full model effectiveness.

7. Connections, Challenges, and Future Directions

The spectral decomposition of gradient trajectories unifies linear, nonlinear, continuous, and discrete optimization analysis, but notable challenges remain:

Ensuring efficient extraction and enforcement of spectral decompositions under realistic computational constraints (as in SD-MoE's ≈5% overhead).
Extending basis function decompositions to settings without explicit eigenstructures, e.g., highly nonstationary or data-augmented learning protocols.
Characterizing the full spectrum and its dynamics in stochastic, nonconvex, or adversarial training regimes.
Establishing rigorous equivalences between nonlinear flows, variational regularization, and inverse scale spaces in more complex, high-dimensional function spaces (Bungert et al., 2019).

Continued analysis, both theoretical and empirical, promises deeper insight into optimization geometry, model specialization, and the feasibility and optimality of modular architectures operating under massive data scale and nonconvexity.