- The paper introduces Latent-GRPO, a novel RL framework that masks invalid latent trajectories, aligns exploration, and selects optimal tokens for stable latent reasoning.
- It demonstrates significant performance gains on benchmarks like GSM8K-Aug and Math500, achieving up to 14.77 points improvement over prior methods.
- The approach reduces reasoning chain lengths by up to 4.7× while ensuring training stability through key innovations like one-sided noise sampling.
Latent-GRPO: Group Relative Policy Optimization for Stable and Efficient Latent Reasoning
Introduction
The increasing demand for efficient yet high-performing reasoning in LLMs has driven interest in “latent reasoning” – the compression of explicit reasoning chains into continuous latent representations. This paradigm promises drastic reductions in reasoning-chain length and, hence, reduction in inference costs. However, RL-based latent reasoning remains fundamentally unstable, suffering from invalid exploration trajectories and unfavorable optimization dynamics. The paper "Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning" (2604.27998) systematically identifies and addresses these challenges, presenting Latent-GRPO, a reinforcement learning framework tailored to latent token reasoning with substantial empirical gains in both performance and reasoning efficiency.
Background: Explicit versus Latent Reasoning and RL Instabilities
Explicit reasoning in LLMs relies on the autoregressive generation of token-level chains of thought (CoT). While providing interpretability and arbitrary trace length, this incurs significant computational redundancy, as performance improvements often depend on longer reasoning traces.
Latent reasoning compresses these explicit chains into sequences of continuous representations (“latent tokens”), usually constructed as soft mixtures of top-K token embeddings. This continuous action space enables shorter reasoning paths. Early supervised latent reasoning approaches (e.g., Latent-SFT [Deng et al., 2025a]) demonstrated promising results, but attempts to apply RL techniques (e.g., Soft-GRPO [Zheng and Lee, 2025]) to latent spaces were plagued by instability, exploration off valid manifolds, and non-closure under mixture, resulting in poor or collapsed policies.
Diagnosing the Bottlenecks of Latent RL
The paper identifies three core technical challenges for RL in latent reasoning:
- Absence of Intrinsic Latent Manifolds: RL exploration in latent space, if not restricted, easily leaves the set of valid reasoning states, causing non-terminating or degenerate outputs. Initialization with latent-supervised fine-tuning (Latent-SFT) is essential, but exploration still threatens manifold integrity.
- Exploration-Optimization Misalignment: Gumbel-noise-based stochastic latent action selection leads to update directions that can be anti-correlated with the reward advantage, deteriorating the learning dynamics and destabilizing training.
- Latent Mixture Non-Closure: In explicit reasoning, positive rewards for several correct continuations naturally resolve via sampling a single branch; in latent reasoning, enforcing multiple correct latents in a shared state results in an averaged state that may be invalid or out-of-manifold (mode averaging).
Latent-GRPO: Methodological Innovations
Latent-GRPO introduces three interlocking algorithmic remedies to the aforementioned problems, grounded in a GRPO—PPO with group-relative rewards—framework.
1. Invalid Sample Advantage Masking
Invalid latent trajectories, identified as those failing to emit EOS/stop tokens before a hyperparameterized cutoff, are masked from group-wise advantage normalization. Only valid trajectories contribute to baseline and advantage computation, isolating optimization away from off-manifold rollouts and preserving latent reasoning stability.
2. One-sided Noise Sampling
To resolve exploration-optimization misalignment, Gumbel perturbations applied to top-K log-probs are clipped and shifted to guarantee strictly positive support, ensuring Δ>0 at rollout for all components. If, during PPO updates, the margin is crossed (Δ<0), a conditional straight-through estimator inverts the gradient. This mechanism aligns the learning signal with the intended advantage, forcing reinforcement or suppression of components in direct accordance with trajectory-level rewards. As shown in the gradient analysis, this eliminates the anti-correlation between policy update direction and trajectory advantage present in prior latent RL approaches.
3. Optimal Correct Path First Token Selection
To mitigate latent mixture non-closure specifically at the first latent token, the method selects the correct trajectory with maximal average surrogate log-probability at the first step and restricts updates for the initial latent token to this path exclusively (other correct paths remain active for subsequent divergent prefixes). This conservative tie-breaking curtails harmful mode averaging at the critical initial action.
Empirical Evaluation
Latent-GRPO is evaluated on both low-difficulty and high-difficulty mathematical reasoning benchmarks, including GSM8K-Aug, SVAMP, MultiArith, Math500, AIME24, AIME25, and GPQA, using LLaMA-3.2-1B-Instruct and Qwen2.5-Math-7B as base models.
Low-Difficulty Regime
- Latent-GRPO achieves an average Pass@1 improvement of 7.86 points over Latent-SFT and outperforms Soft-GRPO by a similar margin, while maintaining 4.4× shorter reasoning chains than explicit GRPO.
- On GSM8K-Aug, Latent-GRPO surpasses explicit GRPO (66.29 vs. 62.26 Pass@1) while compressing average chain length by a factor of ~4.7.
High-Difficulty Regime
- On challenging tasks (Math500, AIME24/25), Latent-GRPO delivers a 14.77 point average gain over Latent-SFT and outperforms explicit GRPO by 4.27 Pass@1 points, compressing the reasoning chain by over 3×.
- Latent-GRPO achieves the highest Pass@1 on Math500 and the AIME benchmarks.
- Under Gumbel sampling, Latent-GRPO demonstrates strong Pass@k curves, reaching 50+ pass@64 on AIME benchmarks, substantially higher than explicit GRPO.
Ablations
Removing One-sided Noise Sampling leads to catastrophic training collapse and uncontrolled length growth, highlighting its necessity for stable optimization. Omitting the First Token Selection mechanism results in less pronounced, but still significant, accuracy and stability degradation, especially on high-difficulty benchmarks.
Implications and Future Directions
The introduction of Latent-GRPO marks a substantial step toward sample-efficient, compact, and high-performing reasoning with LLMs in continuous latent spaces. Empirical results confirm that, with appropriate algorithmic controls, RL post-training can scale to latent token reasoning without sacrificing stability or performance, and while dramatically reducing the computation footprint.
This methodology opens several avenues for further exploration:
- Latent RL scaling: Extending to even larger models and more diverse domain-specific tasks.
- Analysis of latent manifold geometry: Understanding how latent action composition is shaped under RL and how robust the resulting representations are to distribution shift.
- Compositional and hierarchical latent reasoning: Developing methods for explicitly structured or modular latent reasoning systems.
- Application to multimodal and program synthesis tasks: Leveraging latent RL for domains where compressing reasoning traces is even more critical for efficiency.
Conclusion
Latent-GRPO delivers principled algorithmic improvements that enable stable and effective RL post-training for latent reasoning in LLMs, simultaneously improving sample efficiency, performance, and reasoning compactness over explicit token-level methods. Critical algorithmic components—invalid sample masking, one-sided noise sampling, and first token selection—directly address pathologies of RL in continuous reasoning spaces. The framework lays a strong foundation for future research on efficient autonomous reasoning and further integration of continuous latent spaces into LLM architectures.
Reference:
- Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning (2604.27998)