LaDi-RL: Diffusion & RL for Advanced Reasoning

Updated 9 February 2026

The paper introduces a framework that decouples exploration in latent spaces from token-level actions, enhancing diversity and credit assignment.
It leverages conditional denoising diffusion and value-guided Q-learning to achieve efficient multi-step planning and robust reasoning.
Empirical results demonstrate significant gains in pass@k metrics and control performance over traditional discrete reinforcement learning methods.

Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL) is a class of frameworks that leverage diffusion models in continuous latent spaces, integrating them with reinforcement learning techniques to enhance sequential reasoning and decision-making. By decoupling exploration in latent (semantic or trajectory) space from surface-level action or token generation, LaDi-RL preserves diversity, improves credit assignment, and enables longer-horizon strategic planning in RL and reasoning domains, ranging from sequence modeling (e.g., code or math reasoning) to offline RL in control environments (Kang et al., 2 Feb 2026, Venkatraman et al., 2023, Kim et al., 2024). This paradigm extends diffusion-based generative modeling with explicit RL objectives, building on latent-trajectory compression, conditional denoising, and value-guided sampling.

1. Core Architecture and Modeling

In LaDi-RL, complex behaviors or solution trajectories are encoded into a continuous latent space, most commonly via a variational autoencoder (VAE). An H-step trajectory segment $\tau_t = (s_t, a_t, s_{t+1}, \ldots, a_{t+H-1}, s_{t+H})$ is mapped to a latent $z_t \in \mathbb{R}^d$ by an encoder $q_\phi(z_t|\tau_t)$ (Kim et al., 2024, Venkatraman et al., 2023). The low-level policy decoder $\pi_\theta(a|s, z_t)$ reconstructs the primitive actions from individual states and latents. Latent trajectory representations support temporal abstraction, allowing each $z$ to encode multi-step behaviors ("macro-actions" or compressed reasoning skills).

A conditional latent diffusion model $p_\psi(z|s)$ is learned, modeling the distribution of plausible latent skills achievable from a given state. The forward diffusion process applies progressively increasing Gaussian noise to latent encodings; the reverse (denoising) process $\mu_\psi(z^j, s, j)$ reconstructs clean latents from noisy inputs, conditioned on state/context (Kim et al., 2024, Venkatraman et al., 2023, Kang et al., 2 Feb 2026).

For control or RL, a Q-network (or other value estimator) $Q(s, z)$ predicts the expected return for executing the latent $z$ from state $s$ . This enables batch-constrained value-based reasoning in the latent space, mitigating extrapolation error and supporting efficient prioritization of long-horizon plans.

In diffusion LLM alignment settings, the latent variable $Z = \{z_1, ..., z_B\}$ encodes a semantic pathway (e.g., a chain-of-thought plan), used as the conditioning context for an autoregressive or diffusive decoder (Kang et al., 2 Feb 2026). Both latent-space and token-level policies can be optimized jointly or independently, with the latent policy controlling high-level exploration and the text policy refining surface realization.

2. Mathematical Formulation and Training Objectives

Training in LaDi-RL involves three sequential or iterative components:

VAE/Latent compression: Minimize the ELBO

$\mathcal{L}_{\rm VAE}(\theta, \phi, \omega) = -\mathbb{E}_{\tau_t \sim \mathcal{D}} \left[ \mathbb{E}_{q_\phi(z|\tau_t)} \sum_{\ell=0}^{H-1}\log \pi_\theta(a_{t+\ell}|s_{t+\ell},z) - \beta D_{\rm KL}(q_\phi(z|\tau_t) \| p_\omega(z|s_t)) \right]$

where $\beta$ controls KL regularization (Kim et al., 2024, Venkatraman et al., 2023).

Latent diffusion model: Train a conditional denoising network using a weighted loss (e.g., Min-SNR- $\gamma$ reweighting):

$\mathcal{L}_{\rm diff}(\psi) = \mathbb{E}_{j, \tau_t, z^0, q} \left[ \min\{\rm SNR(j), \gamma\} \| z^0 - \mu_\psi(z^j, s_t, j)\|^2 \right]$

where $z^j$ is the noisy version of $z^0$ at diffusion step $j$ (Kim et al., 2024).

Value-guided RL: Employ a batch-constrained Q-learning objective in latent space. For a transition $(s_t, z_t, r_{t:t+H}, s_{t+H})$ ,

$Q(s_t, z_t) \leftarrow r_{t:t+H} + \gamma^H \, \mathbb{E}_{z' \sim p_\psi(z'|s_{t+H})}\left[\min_{i=1,2} Q_{\rm target,i}(s_{t+H}, z') \right]$

(Kim et al., 2024, Venkatraman et al., 2023). In LaDi-RL for sequence modeling, one instead optimizes policy-gradient surrogates, group-relative advantages, or distribution-matching objectives for latent or token policies (Kang et al., 2 Feb 2026).

Notably, decoupling exploration into latent and token spaces enables explicit control over semantic variability during RL. The overall RL objective can be a joint combination:

$\mathcal{L}_{\rm RL} = \alpha \cdot \mathcal{L}_{\rm latent}^{\rm clip} + (1-\alpha) \cdot \mathcal{L}_{\rm text}^{\rm clip}$

where each term is a clipped surrogate based on importance ratios and relative advantage, and $\alpha$ tunes the text-latent balance (Kang et al., 2 Feb 2026).

3. Exploration, Diversity, and Decoupling

LaDi-RL frameworks introduce several exploration-enhancing mechanisms:

Stochastic multi-step denoising: Each diffusion step adds fresh Gaussian noise, distributing exploration stochastically across the latent trajectory; this helps prevent mode collapse associated with token-level RL and supports discovery of multi-modal solutions (Kang et al., 2 Feb 2026).
Diversity/repulsion guidance: Early denoising steps are guided by pairwise repulsion among latent trajectories, explicitly preserving diversity in the set of sampled solutions (Kang et al., 2 Feb 2026).
Latent–token decoupling: The latent exploration policy is optimized with a group-level reward signal over semantic trajectories, while the token-level policy is optimized with fine-grained, per-sample or per-segment advantages. This decoupling allows semantic-level exploration to be separated from surface realization, improving sample efficiency and reasoning diversity (Kang et al., 2 Feb 2026, Venkatraman et al., 2023).

These mechanisms enable preservation of multiple reasoning modes and avoidance of policy entropy collapse mandated by single-mode token-RL. Results show that LaDi-RL maintains higher reward variance and outperforms discrete-RL baselines in pass@k diversity for code and math reasoning (Kang et al., 2 Feb 2026).

4. Algorithmic Realizations and Inference Dynamics

The canonical LaDi-RL pipeline consists of the following stages (Kim et al., 2024, Venkatraman et al., 2023, Kang et al., 2 Feb 2026):

Latent trajectory compression: Obtain $z_t$ for each H-step segment.
Conditional diffusion modeling: Learn or fine-tune $p_\psi(z|s)$ or $p_\psi(z|Q)$ via denoising score matching or flow-matching in the latent space.
RL-based value estimation: For control tasks, utilize Q-learning with batch constraints in $(s, z)$ space; for reasoning, optimize policy surrogates with group-relative, step-aware, or distribution-matching rewards in both latent and surface spaces.
Inference:
- Sample a batch of latent plans $z^{(i)}$ via the trained diffusion model; select $z^*$ by maximizing the Q-value $Q(s, z)$ or by maximizing latent reward-estimates.
- For sequence modeling, generate multiple output sequences conditioned on each $z^{(i)}$ and select based on final or per-segment rewards.
- Decoding from $z^*$ produces either a sequence of primitive actions (control) or a reasoning trace (language modeling).

This architecture supports efficient, one-step-per-latent planning, as each latent encapsulates an H-step lookahead and the diffusion prior samples only in high-support regions of the dataset (Kim et al., 2024, Venkatraman et al., 2023).

5. Empirical Performance and Applications

LaDi-RL-style models exhibit strong empirical gains across both RL and reasoning domains. Representative results include (Kang et al., 2 Feb 2026, Kim et al., 2024, Venkatraman et al., 2023):

Task Domain	Baseline	LaDi-RL Variant Pass@1	Absolute Gain
Code generation (MBPP)	GRPO 72.3%	LaDi-RL 84.2%	+11.9%
Math reasoning (GSM8K)	GRPO 67.7%	LaDi-RL 76.7%	+9%
Control (ARC benchmark)	VAE/No Q: ~10%	LDCQ: >90% reach, ~77% Submit	+80+ pts

Qualitative analyses show that latent-diffusion RL agents can skip redundant actions, reliably identify solution states, and maintain high exploration diversity. In code/math benchmarks, LaDi-RL achieves consistent improvements over discrete token-RL in both final accuracy and pass@k metrics, attributed to its diversity-preserving latent sampling (Kang et al., 2 Feb 2026). In control, LaDi-RL agents demonstrate superior stitching of suboptimal trajectories and stronger long-term planning (Kim et al., 2024, Venkatraman et al., 2023).

6. Theoretical Foundations and Limitations

The key theoretical insight in LaDi-RL is that planning or exploration in continuous latent space via guided diffusion formalizes the search for policies close to the behavior distribution but biased towards high-reward (or high-Q) regions (Li, 2023, Venkatraman et al., 2023). The energy-guided sampling formulation, where

$\pi^*(z|s) \propto \mu(z|s) \exp\left(\beta \sum_t Q(s_t, a_t)\right)$

with $\mu(z|s)$ a learned diffusion prior, unifies latent planning and policy improvement.

Limitations of current LaDi-RL approaches include the computational cost of iterative diffusion sampling (DDPM/score-based models typically require hundreds of steps), challenges in scaling to variable-length or hierarchical reasoning, and sensitivity to hyperparameters governing temporal abstraction and repulsion penalty (Venkatraman et al., 2023, Kang et al., 2 Feb 2026). There remains open ground in sample-efficient off-policy correction, adaptation of latent-step horizons, and extension to hierarchical or multimodal latent spaces.

Planned extensions include:

Adoption of fast samplers (e.g., DDIM, DPM-solver, consistency models) to reduce inference latency.
Adaptive or hierarchical latent planning to span variable reasoning or action horizons (Venkatraman et al., 2023).
Further theoretical analysis of latent policy geometry, energy landscapes, and off-policy RL adaptation (Kang et al., 2 Feb 2026, Li, 2023).

7. Impact and Broader Significance

LaDi-RL represents a shift from token-level, discrete RL to semantically structured continuous RL, providing a principled solution to diversity collapse, mode seeking, and temporal abstraction challenges. By decoupling semantic exploration from surface realization, LaDi-RL builds a foundation for more robust, generalizable, and interpretable reasoning agents—both for control (offline RL, strategic planning) and reasoning (LLMs, code generation, symbolic math) applications.

Empirical evidence suggests that diffusion-based latent RL is a principled and increasingly effective alternative to conventional policy gradient or value-based RL in reasoning-rich tasks, achieving state-of-the-art results and permitting a new class of algorithms that exploit temporal and semantic abstraction at scale (Kang et al., 2 Feb 2026, Kim et al., 2024, Venkatraman et al., 2023).