ReLaX: Reasoning with Latent eXploration

Updated 16 December 2025

The paper introduces ReLaX, a novel paradigm that explores reasoning within the latent space of large models.
It employs Koopman operator theory to linearize hidden-state dynamics and uses dynamic spectral dispersion to quantify and regularize exploration.
Empirical results demonstrate significant gains in reasoning accuracy, efficiency, and robustness across a variety of text-only and multimodal benchmarks.

Reasoning with Latent eXploration (ReLaX) refers to a paradigm in which the exploration, optimization, and evaluation of reasoning processes in large models—especially LLMs and Large Reasoning Models (LRMs)—are conducted directly within the model’s latent (hidden-state) space rather than in the discrete output/token space. This approach exploits the computational and geometric structure of the latent space to drive diverse, interpretable, and more efficient solution discovery, leveraging dedicated metrics and algorithms that quantify and regularize latent dynamics. The ReLaX methodology includes recent advances in Koopman operator theory for latent dynamics analysis, policy optimization using latent-space dispersion as an exploration signal, instance-level test-time policy-gradient adaptation, latent diffusion over “thought blocks,” and the use of informative latent-trajectory signals for selection or early exit. Empirical studies demonstrate that these strategies yield substantial gains over token-space methods in reasoning accuracy, efficiency, and robustness across a suite of text-only and multimodal benchmarks (Zhang et al., 8 Dec 2025, Kang et al., 6 Oct 2025, Li et al., 19 May 2025, Vilas et al., 12 Oct 2025, Lee et al., 2019).

1. Latent-State Dynamics and Koopman Operator Linearization

ReLaX postulates that the sequence of hidden states $\{x_t\}$ within LRMs encodes a rich dynamic computation underlying explicit token generation. The framework models these hidden-state transitions as $x_{t+1} = f_t(x_t)$ , with $f_t$ parameterized by token history, context, and model weights. To analyze and regularize these dynamics, ReLaX employs Koopman operator theory, lifting the nonlinear dynamics into a function space of observables $g:\mathbb{R}^d\to\mathbb{R}^m$ that evolve linearly under the Koopman operator $\mathcal{K}$ , defined by $[\mathcal{K}g](x_t) = g(f_t(x_t))$ .

Practically, given batches of latent trajectory pairs, Dynamic Mode Decomposition estimates $\mathcal{K}$ via least squares: $\mathcal{K} \approx V^+ V^\dagger$ , where $V$ and $V^+$ are the batched observables at consecutive timesteps, and $V^\dagger$ is the Moore-Penrose pseudoinverse. This operator enables spectral analysis of the latent computation and undergirds the use of dynamic spectral metrics for exploration quantification (Zhang et al., 8 Dec 2025).

2. Dynamic Spectral Dispersion: Quantifying Latent Exploration

Dynamic Spectral Dispersion (DSD) is a central metric in ReLaX for quantifying the heterogeneity of model latent trajectories. After estimating the Koopman operator $\mathcal{K}$ , its spectral decomposition $\mathcal{K}\Phi = \Phi\Lambda$ yields eigenvalues $\{\lambda_i\}_{i=1}^m$ . DSD is defined as the variance of the modulus of these eigenvalues: $DSD(x_{0:T}) = \frac{1}{m}\sum_{i=1}^{m} \left(|\lambda_i| - \bar{r}\right)^2,\qquad \bar{r} = \frac{1}{m}\sum_{i=1}^m |\lambda_i|$ A high DSD implies that the model is traversing diverse temporal regimes in latent space (decay/growth/oscillation modes), which empirically aligns with high exploration and reduced mode collapse. A low DSD indicates repetitive or collapsed trajectories, typically associated with premature policy convergence (entropy collapse) in RL-based reasoning (Zhang et al., 8 Dec 2025).

DSD is used both as a diagnostic and a regularizer in policy optimization. It is modulated via an exploration coefficient $\alpha$ and, if exceeding a threshold $\xi$ , invokes additional KL regularization to prevent excessive dispersion.

3. The ReLaX Algorithm: Policy Optimization in Latent Space

The ReLaX training loop incorporates DSD-aware exploration into Group Relative Policy Optimization (GRPO). Each update proceeds as follows:

A model generates latent trajectories for a batch of prompts.
The Koopman dictionary is fitted (only at initialization), and DSD is computed per trajectory.
The clipped GRPO surrogate objective $J_{GRPO}(\theta)$ is combined with DSD regularization and an adaptive KL penalty over highly-dispersed trajectories:

$J(\theta) = J_{GRPO}(\theta) + \alpha L_{xp} + \beta \sum_{i\in I} D_{KL}(\pi_\theta\|\pi_{ref})$

where $L_{xp}$ is the advantage-shaped DSD regularizer, and $I$ indexes over-high DSD cases.

Model parameters $\theta$ are updated via gradient ascent.

Hyperparameters require tuning: $\alpha$ (exploration), $\beta$ (KL penalty), $\xi$ (DSD threshold), and Koopman dimension $m$ (Zhang et al., 8 Dec 2025).

4. Alternative ReLaX Instantiations: Diffusion and Policy Gradient Adaptation

Variants of ReLaX integrate latent exploration into reasoning in fundamentally different ways:

Latent Diffusion Reasoner (LaDiR): ReLaX is realized by first mapping chain-of-thought (CoT) reasoning steps into blocks of interpretable semantic embeddings using a $\beta$ -VAE, then refining these blocks with a latent diffusion model $f_\psi$ via continuous-time flow-matching or DDPM objectives. Diffusion occurs within blocks (bidirectional attention) while maintaining autoregressive order across blocks. Diverse trajectories are generated via noise sampling and diversity guidance (repulsion), and decoded to text for answer aggregation (Kang et al., 6 Oct 2025).

Test-Time Policy Gradient Adaptation (LatentSeek): ReLaX is instantiated via instance-level adaptation of latent trajectories at test time. For each instance, a parametric policy $\pi_\theta(z'|z)$ proposes latent updates; policy gradients are computed relative to self-reward or perfect-signal grading. Latents are iteratively updated until a reward threshold is achieved, usually converging in a few steps and showing improved accuracy and compute efficiency over token-space sampling (Li et al., 19 May 2025).

Latent-Trajectory Signal Aggregation: Without explicit training, latent trajectory metrics (net change, cumulative change, alignment) are used as inference-time probes to select promising reasoning paths or trigger early exit. These signals capture whether latent evolution is directed, steady, or overly meandering, and outperform output-confidence baselines in predicting correctness (Vilas et al., 12 Oct 2025).

A summary comparison of methods:

Method	Exploration Mechanism	Regularizer/Metric
Koopman/DSD (ReLaX)	Koopman spectrum	Dynamic Spectral Disp.
LaDiR	Latent diffusion, flow-matching	Diversity guidance
LatentSeek	Test-time PG in latents	Self/PSRM reward
LT-signal aggregation	Trajectory-based thresholding	Net/Cum/Aligned change

5. Empirical Results and Benchmarks

ReLaX and its variants have achieved significant improvements across text-only and multimodal reasoning tasks:

On MathVista and related multimodal benchmarks, ReLaX (Qwen2.5-VL, 7B) achieves avg@1 of 53.2 (+5.3 over base GRPO, SOTA vs. prior 52.5).
On MATH500, Minerva, AMC2022/2023, AIME2024/2025, ReLaX with 7B Qwen2.5-Math-Base outperforms FR3E by +6.3 avg.
Latent diffusion methods (LaDiR) yield 41.8% Pass@1 vs. 40.4% (AR CoT SFT) and substantial diversity gains on math and planning benchmarks.
LatentSeek (instance-level latent PG) surpasses CoT by 8.9 pp on GSM8K and other benchmarks, with best-of-N sampling in latent space outperforming token space at comparable compute.
Latent-trajectory signals reduce token usage by up to 70% while improving or preserving accuracy by 2.6 pp on algorithmic, math, and science QA (Zhang et al., 8 Dec 2025, Kang et al., 6 Oct 2025, Li et al., 19 May 2025, Vilas et al., 12 Oct 2025).

Ablations confirm DSD’s necessity for stable and effective exploration; without it, mode collapse and early convergence occur.

6. Theoretical Foundations and Interpretability

The efficacy of ReLaX derives from the expressivity and analyzability of latent-state dynamics. Latent diffusion and policy-gradient steps exploit the continuity and smoothness of the latent manifold, enabling semantic-level corrections and efficient trajectory diversification. Koopman analysis formalizes the system’s exploration in spectral terms, offering principled design and monitoring of regularization objectives.

Decoding intermediate latents in diffusion-based ReLaX yields interpretable “thought tokens,” allowing mapping of abstract latent operations back onto the reasoning process, thus facilitating both human interpretability and debugging of reasoning failures (Kang et al., 6 Oct 2025, Vilas et al., 12 Oct 2025).

In formal reasoning (e.g., mathematics), latent-state deduction can predict multi-step rewrite-sequence outcomes purely from embeddings, demonstrating the feasibility of latent-space deduction on symbolic tasks (Lee et al., 2019).

7. Limitations and Future Directions

ReLaX methodologies introduce both algorithmic and computational challenges:

Koopman dictionary fitting and DSD computation incur computational overhead and are currently performed only once per training run.
Sensitivity to hyperparameters (e.g., DSD threshold, regularization coefficients) necessitates careful tuning.
Fixed dictionaries may lag evolving latent manifolds; jointly adaptive dictionaries could improve fidelity but risk instability.
In diffusion-based ReLaX, two-stage training (VAE then diffusion) complicates end-to-end optimization; block segmentation and diversity guidance schedules remain heuristic.
VAE compression in diffusion-based ReLaX trades off between semantic compactness and fine-grained reasoning fidelity.
Extensions to multi-modal, dialogue-based, or dynamically segmented reasoning are open research directions.

Overall, ReLaX and related latent-exploration paradigms shift the focus of exploration control from token-level entropy to the structure and dispersion of internal computations, offering measurable gains in accuracy, diversity, and computational efficiency across a wide spectrum of reasoning benchmarks (Zhang et al., 8 Dec 2025, Kang et al., 6 Oct 2025, Li et al., 19 May 2025, Vilas et al., 12 Oct 2025, Lee et al., 2019).