Papers
Topics
Authors
Recent
Search
2000 character limit reached

RPO: Partial Reasoning Optimization for LLMs

Updated 3 February 2026
  • The paper demonstrates that RPO reduces token generation by 95% while improving model performance compared to full-path RL methods.
  • RPO employs an experience cache and truncated rollouts to lower computational overhead and reduce gradient variance.
  • Integrated with GRPO and DAPO frameworks, RPO achieves up to +5.4% accuracy gains with significantly shorter training times.

Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO) is a class of reinforcement learning (RL) algorithms designed to enhance the efficiency and stability of LLM fine-tuning by selectively generating and optimizing only parts of the reasoning trajectory, rather than entire chain-of-thought (CoT) sequences. RPO enables substantial reductions in the computational burden typical of RL-based fine-tuning, maintains or improves model performance compared to full-path RL algorithms, and integrates seamlessly with existing policy optimization frameworks such as Group-Relative Policy Optimization (GRPO) and Divergence-Aware Policy Optimization (DAPO) (Yi et al., 27 Jan 2026).

1. Motivation and Conceptual Background

Traditional RL fine-tuning approaches for LLMs (e.g., PPO, GRPO, DAPO) require rolling out a full reasoning sequence for each query in every training step. This paradigm incurs:

  • Excessive token-generation overhead: rollouts may require thousands of tokens per step.
  • Compute under-utilization, as gradients are only computed after all rollouts complete.
  • High-variance, low-signal updates due to delayed sparse rewards and unanchored sequence prefixes.

Partial Reasoning Optimization addresses these inefficiencies with the observation that not all tokens in a reasoning trajectory contribute equally to the final task reward. By "replaying" high-reward prefixes from an experience cache and only generating suffixes de novo for optimization, RPO focuses computational effort where it is most impactful (Yi et al., 27 Jan 2026).

2. Formal Problem Statement and Objective

Let D={xk}k=1N\mathcal{D} = \{ x_k \}_{k=1}^N be a dataset of queries. The LLM policy πθ\pi_\theta defines a distribution over trajectories r=(r1,,rT)πθ(x)r = (r_1,\dots,r_T) \sim \pi_\theta(\cdot|x), where rTr_T yields a sparse final reward R(r)R(r). The canonical RL objective is augmented with a Kullback–Leibler regularization term to a reference policy πref\pi_{\rm ref}: J(θ)=Erπθ(x)[R(r)]βDKL(πθ(x)πref(x)),J(\theta) = \mathbb{E}_{r \sim \pi_\theta(\cdot|x)} [ R(r) ] - \beta D_{\rm KL}(\pi_\theta(\cdot|x) \| \pi_{\rm ref}(\cdot|x)), with the gradient estimator decomposed token-wise using a clipped surrogate as in PPO/DAPO/GRPO: θJ(θ)=E[1Tt=1Tmin(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]βθDKL.\nabla_\theta J(\theta) = \mathbb{E} \Bigg[ \frac{1}{T} \sum_{t=1}^T \min \big( r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \big) \Bigg] - \beta \nabla_\theta D_{\rm KL}. Here, rt(θ)r_t(\theta) is the policy ratio at time tt, and πθ\pi_\theta0 is a normalized advantage (Yi et al., 27 Jan 2026).

3. RPO Algorithm: Experience Replay and Truncated Rollouts

The core algorithmic contribution of RPO consists of:

  • Experience Cache πθ\pi_\theta1: For each query πθ\pi_\theta2, store the best-known full reasoning path πθ\pi_\theta3 by achieved reward.
  • Truncated Rollout: Instead of generating the full trajectory, sample a truncation length πθ\pi_\theta4, retrieve the cached prefix πθ\pi_\theta5 (first πθ\pi_\theta6 tokens), and only generate the πθ\pi_\theta7-token suffix. This drastically limits the length of rollouts needed during each optimization step.
  • Suffix Optimization: For each truncated prefix, sample πθ\pi_\theta8 suffixes, compute intermediate rewards (optionally with length-aware shaping), and apply the clipped policy gradient update.
  • Cache Update: After each gradient step, the cache is refreshed in an πθ\pi_\theta9-greedy manner, storing the highest-reward newly observed trajectory with probability r=(r1,,rT)πθ(x)r = (r_1,\dots,r_T) \sim \pi_\theta(\cdot|x)0.

The process is formally described in a multi-level pseudocode (see (Yi et al., 27 Jan 2026)), with notation preserved for reproducibility.

Token reduction is quantified by r=(r1,,rT)πθ(x)r = (r_1,\dots,r_T) \sim \pi_\theta(\cdot|x)1, yielding empirical rollouts reduced by approximately 95% in major experiments.

4. Theoretical Properties and Analytical Results

Variance Reduction

By conditioning exploration on high-reward prefixes, RPO provably reduces gradient variance compared to full-path policy gradient approaches: r=(r1,,rT)πθ(x)r = (r_1,\dots,r_T) \sim \pi_\theta(\cdot|x)2 improving learning signal stability and reducing susceptibility to policy collapse (Yi et al., 27 Jan 2026).

Reward Shaping and Bias-Variance Trade-offs

Inclusion of length-aware reward shaping—giving slightly higher rewards to shorter, correct completions—further reduces the mean squared error (MSE) of gradient estimation: r=(r1,,rT)πθ(x)r = (r_1,\dots,r_T) \sim \pi_\theta(\cdot|x)3 A plausible implication is that the choice of shaping parameter r=(r1,,rT)πθ(x)r = (r_1,\dots,r_T) \sim \pi_\theta(\cdot|x)4 allows explicit control over the stability–diversity trade-off in learning dynamics (Yi et al., 27 Jan 2026).

5. Integration with Existing RLHF Frameworks

RPO is a minimally invasive modification that changes only the rollout sampling strategy; the policy-gradient formulation (including advantage normalization, clipped surrogate, and KL regularizer) remains identical. This plug-and-play property allows integration into a wide spectrum of RLHF algorithms:

  • GRPO (Group-Relative Policy Optimization): RPO replaces the “generate full rollout” step with “retrieve-and-truncate + suffix generation” within the same group-based advantage estimation.
  • DAPO (Divergence-Aware Policy Optimization): The surrogate loss and trust-region parameters are untouched; only rollout construction is optimized.

All existing hyperparameter schedules, trust-region settings, and optimization recipes carry over without change.

6. Experimental Evaluation and Empirical Impact

Experiments in (Yi et al., 27 Jan 2026) employ DeepSeek-R1-Qwen-Distill models (1.5B and 7B parameters) across mathematics and reasoning-focused benchmarks (AIME25, AIME24, MATH500, AMC23, Minerva, OlympiadBench), with zero-shot evaluation via LightEval. Key findings include:

Model + Method Token Reduction Training Time (hours) Zero-shot Accuracy (avg, 6 datasets)
1.5B + GRPO base 77.3 49.1%
1.5B + RPO+shaping 95% 8.4 51.7%
7B + GRPO base 84.5 65.6%
7B + RPO+shaping 95% 23.5 67.8%

Further:

  • Average tokens per sample reduced from 2689 to 146 (1.5B), 2458 to 147 (7B).
  • RPO achieves 90%–72% reduction in wall-clock training time (1.5B/7B), while slightly improving final accuracy.
  • In long-run training, GRPO exhibits response-length collapse and ∼8.6% accuracy degradation, while RPO maintains length stability and delivers up to +5.4% accuracy gain (see Figure 1 and Table 2 in (Yi et al., 27 Jan 2026)).

7. Limitations, Trade-offs, and Future Work

One limitation of RPO is a reduction in sample diversity due to reusing identical prefixes. To mitigate collapsed exploration, length-aware reward shaping is essential. Prospective directions include:

  • Dynamic scheduling of prefix truncation lengths to encourage broader exploration.
  • Multi-query caching (tree/graph-structured replay) for further sample efficiency.
  • Application of RPO to multi-modal RL, code generation, and instruction-following settings (Yi et al., 27 Jan 2026).

Partial Reasoning Optimization is distinct from broader "partial reward" or "partial path" approaches in RL for LLMs:

  • Partial Reward Functions: In text-to-SQL and multi-stage tasks, RL frameworks (e.g. (Pourreza et al., 29 Mar 2025)) design fine-grained reward signals (schema-linking, syntax check, execution correctness), but still generate full output sequences.
  • Branch-based Reasoning Optimization: Reasoning Paths Optimization (Chia et al., 2024) samples diverse trajectory continuations for contrastive learning, but focuses on preference objectives at each step and requires generating multiple alternatives for every prompt prefix.

RPO, by contrast, directly reduces rollout length through experience-truncation, with theoretical guarantees on variance and integrability with major RLHF algorithms (Yi et al., 27 Jan 2026). A plausible implication is that this approach sets a new baseline for efficient, scalable RL fine-tuning of LLMs in reasoning-heavy domains.


Key Reference:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO).