2000 character limit reached

Reinforced Reasoning with LLMs

Updated 17 November 2025

The paper introduces a reinforced reasoning framework using policy-gradient methods (PPO, REINFORCE++, GRPO) to optimize both intermediate reasoning and final answer accuracy.
It utilizes conditional, rule-based reward shaping to enforce proper formatting and intermediate correctness, enhancing the overall reasoning process.
Experimental results show up to +19.3% improvement in Pass@1 and an 80% reduction in time-to-first-token across diverse datasets.

Reinforced reasoning with LLMs encompasses a suite of methodologies that combine reinforcement learning (RL) paradigms, rule-based intermediate rewards, and architecture-level innovations to boost multi-hop, step-wise reasoning. The central technical advance is the use of policy-gradient RL to optimize reasoning traces, not only for final correctness but also for intermediate reasoning quality, often via conditional, time-discounted rewards. This framework—in contrast to pure supervised fine-tuning (SFT)—supports interleaved thought–answer sequences, enables dense feedback for each intermediate step, and substantially improves both efficiency (measured by time-to-first-token, TTFT) and accuracy across a variety of complex problem domains.

1. Reinforcement Learning Objectives and Update Schemes

Reinforced reasoning is typically formalized as a sequential decision process, with the LLM acting as a token-level policy $\pi_\theta(a_t \mid s_t)$ at time $t$ , emitting tokens $a_t$ in context $s_t=(x, y_{<t})$ , where $x$ is the question and $y_{<t}$ the token history. Transition dynamics are autoregressive:

$s_{t+1} = (x, y_{<t} \circ a_t)$

The standard RL objective trades off expected reward against divergence from a reference policy:

$J(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y\sim\pi_\theta(\cdot|x)} [ r(x, y) ] - \beta D_{\mathrm{KL}}[\pi_\theta(\cdot|x) \| \pi_\mathrm{ref}(\cdot|x)]$

Policy-gradient updates are realized via three principal algorithms:

REINFORCE++: Monte-Carlo gradients over entire sequences, using returns minus baselines.
Proximal Policy Optimization (PPO): A clipped surrogate objective, with per-token ratios and generalized advantage estimates (GAE).
Group Relative Policy Optimization (GRPO): Batchwise comparison of sampled completion rewards, normalized by batch mean, omitting the need for explicit critic networks.

These provide sample-efficient, stable updates even as chain-of-thought trajectories grow in length and complexity.

2. Rule-Based, Conditional Reward Shaping

Reward shaping is critical for reinforced reasoning. The composite reward function is:

$r(x, y) = r_{\rm format}(y) + r_{\rm final}(x, y) + r_{\rm inter}(x, y)$

Format Reward $r_{\rm format}(y)$ : Enforces syntactic alternation between > and <answer> tags, rewarding/exactly penalizing correct alternation ( $\pm1$ ). > > - Final-Answer Accuracy $r_{\rm final}(x, y)$ : Assigns $+2$ reward for exact match, $-1.5$ for incorrect, $-2$ if unparseable. > > - Conditional Intermediate-Step Reward $r_{\rm inter}(x, y)$ : Applied only when (i) format is valid, (ii) final answer is correct, and (iii) batch accuracy is improving. For $N$ answer segments, intermediate correctness is time-discounted: > > $r_{\rm inter}(x, y) = \mathbf{1}(\text{conditions}) \times \begin{cases} R_0 & \text{all intermediates correct} \ R_0\,\frac{1}{|\mathcal{C}|}\sum_{j\in\mathcal{C}} \frac{1}{j} & \text{otherwise} \end{cases}$ > > Where $\mathcal{C}$ is the set of correct intermediate answers. > > Conditional application ensures dense intermediate guidance only when beneficial, preventing myopic reward hacking and stabilizing training. > > ## 3. Interleaved Think–Answer Trace and Latency Metrics > > The interleaved mechanism modifies traditional think-answer traces by producing intermediate answers after each reasoning segment. Formally: > > $y = y^{(1)}_{\rm think}\circ y^{(1)}_{\rm answer}\circ y^{(2)}_{\rm think}\circ y^{(2)}_{\rm answer}\circ\dots\circ y^{(N)}_{\rm answer}$ > > With intermediate answers revealed as soon as generated, time-to-first-token (TTFT) is reduced: > > $\mathrm{TTFT} = \frac{t_{\rm first}}{T_{\rm total}}$ > > Empirical results indicate an average TTFT reduction above 80% when using interleaved reasoning versus conventional think–then–answer approaches. > > ## 4. Experimental Validation and RL Algorithm Comparison > > Evaluations were performed across five public datasets: > > - In-domain: Knights & Knaves, MuSiQue > > - Out-of-domain: MATH, GPQA, MMLU > > Key metrics: Pass@1 accuracy and normalized TTFT. > > ### Quantitative Results for Qwen2.5-1.5B (PPO-trained) > > | Method | Pass@1 Accuracy | TTFT | > |-------------------------------|-----------------|--------| > | Baseline think–answer | 42.0% | 0.875 | > | Interleave (no interm. reward)| 41.6% | 0.172 | > | Interleave + interm. reward | 50.1% | 0.169 | > > Relative improvements of up to +19.3% in Pass@1 and 80% TTFT reduction were observed. > > ### RL Algorithm Trade-offs (7B model) > > - PPO: Maximal accuracy/stability, slower initial convergence. > > - GRPO/REINFORCE++: Greater sample-efficiency, larger training variance. > > Across all settings, interleaved reasoning with intermediate reward outperformed baseline strategies on both accuracy and speed. > > ## 5. Generalization, Effectiveness of Conditional Intermediate Reward, and Training Insights > > Models trained only on logical reasoning/QA datasets displayed strong generalization gains when evaluated on unseen complex reasoning tasks (MATH, GPQA, MMLU): > > - The interleaved format always preserved or improved Pass@1 accuracy, with dramatic latency improvements. > > - Effective RL relies on conditional intermediate rewards: always-on intermediate rewards drove myopic intermediate optimization, harming final answer accuracy. Selective reward application, gated by overall batch performance and correctness, stabilized credit assignment. > > - The optimal reward schedule is time-discounted and decays over training as accuracy improves, implying dense intermediate guidance is only required during early or low-accuracy batches. > > - Structural adherence to the think/answer format is learned quickly; remaining performance improvements arise from quality of intermediate reasoning, not from format itself. > > ## 6. Implementation Considerations and Limitations > > - The RL fine-tuning framework requires no external tools and leverages token-level policy models compatible with any left-to-right LLM. > > - The interleaved approach is agnostic to model scale and can be plugged into existing PPO, GRPO, or REINFORCE++ infrastructures. > > - Empirical findings generalize across model sizes and architectures. > > - Limitations include reliance on ground-truth intermediate answer annotation (for reward calculation) and potential reward hacking if gating conditions are improperly tuned. > > - The approach is extensible to other step-wise reasoning paradigms given adjustment of reward schedules and gating mechanisms. > > ## 7. Impact and Broader Connections > > Reinforced interleaved reasoning constitutes a distinct paradigm that optimizes not just final accuracy, but reasoning trajectory structure, interpretability, and efficiency. The improvements in TTFT are particularly relevant for real-world applications requiring rapid inference and partial intermediate reporting, such as interactive agents, step-wise tutoring, and multi-hop QA systems. The observed generalization to out-of-domain tasks suggests reinforced reasoning is a promising foundation for scalable, robust, and adaptive LLM-based reasoners. > > This method belongs to a larger trend toward RL-guided step-wise reasoning, as surveyed by (Xu et al., 16 Jan 2025), in which reinforcement learning endows models with the capacity to generate, critique, and revise reasoning traces at granular levels—a key milestone toward unified “Large Reasoning Models.”

PDF Markdown Chat (Pro)

References (1)

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to Reinforced Reasoning with Large Language Models.

Reinforced Reasoning with LLMs

1. Reinforcement Learning Objectives and Update Schemes

2. Rule-Based, Conditional Reward Shaping

Follow Topic

Continue Learning

Related Topics