Spectral Policy Optimization (SPO)

Updated 3 September 2025

Spectral Policy Optimization (SPO) is a class of reinforcement learning algorithms that uses spectral methods, including tensor and matrix decompositions, optimal transport, and spectral risk measures.
SPO algorithms improve latent variable estimation in POMDPs, enable efficient off-policy evaluation, and drive trust region policy updates using regularization techniques.
These approaches yield robust performance with theoretical guarantees in risk-sensitive RL, structured environments, and large-scale language model applications.

Spectral Policy Optimization (SPO) is a class of policy optimization algorithms in reinforcement learning that utilize spectral methods—specifically tensor and matrix decompositions, regularized optimal transport, or spectral risk measures—to achieve principled and efficient learning, evaluation, or risk control. SPO algorithms appear in diverse subdomains, including partially observable Markov decision processes (POMDPs), trust region policy improvement, off-policy evaluation, risk-constrained reinforcement learning, and structured RL for LLMs.

1. Foundations of Spectral Policy Optimization

SPO fundamentally exploits the algebraic structure in reinforcement learning problems by representing critical quantities (model parameters, value functions, risk measures) through spectral decompositions. In latent-variable domains such as POMDPs, spectral tensor and matrix methods disentangle observable data into constituent latent parameters (transition, reward, observation distributions) via higher-order moment tensor decompositions (Azizzadenesheli et al., 2016). For trust region methods, entropic optimal transport regularization induces policy updates that redistributes probability mass spectrally across action space (Song et al., 2023). In risk-sensitive RL, spectral risk measures generalize classic risk quantification by integrating weighted quantiles over return distributions, inducing dual forms that allow tractable optimization (Kim et al., 29 May 2024). In off-policy evaluation, spectral decomposition yields linear representations of both Q-functions and stationary distribution correction ratios, bypassing nonconvex saddle-point problems (Hu et al., 23 Oct 2024).

2. Spectral Methods for Latent Structure Estimation

Spectral estimation in partially observable RL is exemplified by constructing multi-view representations—triplets of previous and current action–observation–reward tuples—that become conditionally independent when observed through the lens of the hidden state and fixed policy. Empirical moment tensors $M_2^{(l)}$ and $M_3^{(l)}$ are computed as:

$M_2^{(l)} = \frac{1}{N(l)} \sum_{t: a_t = l} \vec{v}_{1,t}^{(l)} \otimes \vec{v}_{2,t}^{(l)}, \qquad M_3^{(l)} = \frac{1}{N(l)} \sum_{t: a_t = l} \vec{v}_{1,t}^{(l)} \otimes \vec{v}_{2,t}^{(l)} \otimes \vec{v}_{3,t}^{(l)}.$

Given nondegeneracy (full column-rank, invertible transitions), $M_3^{(l)}$ admits a decomposition revealing state-dependent parameters. Whitening and de-whitening procedures recover observation matrices, reward distributions, and transitions linearly, often via pseudoinverse computation. These spectral algorithms yield consistent parameter estimates despite high-dimensional observation spaces and serve as input for optimistic planning oracles (Azizzadenesheli et al., 2016).

3. Spectral Approaches to Off-Policy Evaluation

Spectral OPE methods, particularly in POMDPs, replace strict invertibility conditions required by proxy-based causal identification with weaker, full-rank joint moment assumptions:

$\operatorname{rank}(P^b(Z_i, a_i, Z_{i-1})) = |\mathcal{U}|,\quad \operatorname{rank}(P^b(Z_i, a_i, \mathcal{H}^o_{i-1})) = |\mathcal{U}|.$

Projections and pseudoinverses $M_{i,a_i}, M'_{i,a_i}$ reduce dimensionality and facilitate eigendecomposition to identify latent state factors. Importance sampling weights are derived as:

$W_{e,b}(v,\tau^o) = \frac{1}{P^b(v|\tau^o)}\,\Pi_e(\tau^o)\,\Gamma_b(v,\tau^o), \qquad v_H(\pi_e)=\mathbb{E}\bigl[v\,W_{e,b}(v,\tau^o)\bigr].$

This approach improves sample efficiency and prediction accuracy, extending applicability even when observable proxies are insufficient for strict identifiability (Nair et al., 2021).

4. Metric-aware Trust Region Policy Optimization: Sinkhorn and Wasserstein Variants

In metric-aware trust region approaches, SPO defines the trust region with Sinkhorn divergence—a regularized version of Wasserstein distance:

Sinkhorn regularization smooths the redistribution of probability mass, favoring local transfer between "neighboring" actions with respect to a cost matrix $d(a, a')$ .

Policy updates are computed through Lagrangian duality, leading to closed-form, entropy-regularized updates:

$\pi_{k+1}(a|s) \propto \pi_k(a|s) \cdot \exp{\frac{A^{\pi_k}(s,a) - \beta d(a, a^*)}{C}}$

As the entropic parameter $\beta \to 0$ , SPO converges to Wasserstein Policy Optimization. Monotonic improvement and global convergence are guaranteed under appropriate decay of regularization (Song et al., 2023).

5. Spectral Risk Measures and Bilevel Optimization in Safe RL

Spectral risk-constrained policy optimization (SRCPO) incorporates risk measures weighted by spectrum functions:

$\mathcal{R}_\sigma(X) = \int_0^1 F_X^{-1}(u)\,\sigma(u)\,du$

Optimization proceeds via bilevel structure: the inner loop updates the policy via a risk-regularized gradient ( $A_{i,g}^{\pi_\theta}$ ), whereas the outer loop optimizes dual variables (parameters $\beta$ controlling the spectrum) via a distributional update (sampler $\xi(\beta)$ ). The dual formulation,

$\mathcal{R}_\sigma(X) = \inf_{g}\,\big( E[g(X)] + \int_0^1 g^*(\sigma(u))\,du \big)$

enables efficient, convergent policy search even in the presence of nonlinear constraints, with convergence guaranteed in tabular settings (Kim et al., 29 May 2024).

6. Spectral Linearization for Efficient Off-policy Evaluation

SpectralDICE establishes that in MDPs with spectrally decomposable (low-rank) transition operators, both Q-functions and stationary correction ratios are linearly representable:

$Q^\pi(s,a) = \langle \phi(s,a), \theta_Q^\pi \rangle, \qquad \zeta(s,a) = \langle \psi(s,a), \theta_d^\pi \rangle$

This transforms the minimax DICE estimation into a convex-concave optimization, amenable to stochastic gradient descent–ascent, and yields sample complexity bounds of $\mathcal{O}(N^{-1/2})$ with explicit dependence on the feature dimension $d$ (Hu et al., 23 Oct 2024).

7. Process Supervision and Spectral Reward in Structured RL

In LLM RL, SPO is interpreted as "coloring" previously binary negative samples via process supervision. AI feedback decomposes multi-step reasoning into fractions of correct substeps, yielding graded reward functions:

$r_{\text{AIF}}(\cdot) = \begin{cases} 1, & \text{if the final answer is correct}\ 1/(1 + \exp(\beta(\mathcal{R}\mathcal{T}\mathcal{S}(\cdot) - \gamma))), & \text{otherwise} \end{cases}$

This diversification resolves the key issue in Group Relative Policy Optimization, where groups of all-incorrect responses otherwise yield zero gradient. Both stylized theoretical analysis and empirical evaluation confirm improved learning dynamics and performance across model scales and benchmarks (Chen et al., 16 May 2025).

Summary Table: Spectral Policy Optimization Variants

SPO Variant	Domain	Spectral Technique
Tensor decomposition for POMDP	Latent RL	Moment tensor decomposition
Sinkhorn-based policy update	Trust region RL	Regularized optimal transport
Spectral risk measures (SRCPO)	Risk-constrained RL	Duality, bilevel optimization
Spectral off-policy evaluation	Off-policy RL	Linear representation via spectral decomposition
Spectral reward for LLM reasoning	Structured RL	Graded process-supervised reward

Each SPO instance demonstrates how spectral methods—whether tensor, matrix, or functional—provide principled, scalable mechanisms for learning, evaluation, or risk-control, yielding strong theoretical guarantees and empirical performance across diverse subfields of reinforcement learning.