Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hybrid Latent Reasoning via Reinforcement Learning (2505.18454v1)

Published 24 May 2025 in cs.CL

Abstract: Recent advances in LLMs have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

Summary

Hybrid Latent Reasoning via Reinforcement Learning

The paper "Hybrid Latent Reasoning via Reinforcement Learning" presents a novel approach to incorporating latent reasoning in LLMs using reinforcement learning (RL). The authors advocate for latent reasoning as an alternative to traditional autoregressive reasoning methods, emphasizing its potential to leverage continuous hidden representations over discrete chain-of-thought (CoT) paths. Despite the promise of latent reasoning, existing methods face incompatibilities with LLMs due to their reliance on CoT traces, which constrain the exploitation of inherent reasoning patterns of LLMs.

In response to these challenges, the authors introduce Hybrid Reasoning Policy Optimization (HRPO), an RL-based framework designed to integrate prior hidden states into the reasoning process of LLMs using a learnable gating mechanism. HRPO facilitates hybrid reasoning by initializing training with token embeddings and progressively incorporating latent features, thereby preserving the generative capabilities of LLMs while allowing expansions into continuous representations. This approach mitigates the need for CoT annotations and reduces training costs commonly associated with latent reasoning models.

The experimental results reported in the paper highlight HRPO’s superiority over existing latent reasoning approaches and established baselines. In knowledge-intensive tasks, HRPO demonstrates substantial performance improvements, outperforming alternative methods like PPO and GRPO. Specifically, HRPO exhibits notable gains in open-domain and multi-hop question answering benchmarks such as HotpotQA and TriviaQA. Furthermore, HRPO excels in STEM-related reasoning tasks, including the GSM8k and MATH datasets, showcasing enhancements in both accuracy and efficiency.

A key insight from the analysis is HRPO's ability to encapsulate reasoning strategies from intrinsic token-level and latent representations, delivering readable trajectories that improve interpretability. Additionally, HRPO models manifest intriguing reasoning patterns such as cross-lingual outputs, shorter completion lengths, and coherent integration of latent features with sampled discrete tokens. These capabilities position HRPO as a scalable and efficient solution for current latent reasoning limitations, paving the way for future exploration in latent space optimizations within LLM frameworks.

The implications of HRPO extend beyond immediate performance gains on diverse benchmarks. The framework underscores the potential of RL-driven hybrid architectures in eliciting deeper reasoning abilities in LLMs, prioritizing interpretability and robustness. Moving forward, this research invites further investigation into the dynamics of hybrid reasoning and proposes examining continuous latent spaces in LLMs more deeply, opening avenues for theoretical developments and practical enhancements in AI. As hybrid reasoning becomes increasingly pivotal in addressing reasoning-intensive tasks, HRPO is poised to influence both academic research and real-world applications.