RPO: Partial Reasoning Optimization for LLMs
- The paper demonstrates that RPO reduces token generation by 95% while improving model performance compared to full-path RL methods.
- RPO employs an experience cache and truncated rollouts to lower computational overhead and reduce gradient variance.
- Integrated with GRPO and DAPO frameworks, RPO achieves up to +5.4% accuracy gains with significantly shorter training times.
Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO) is a class of reinforcement learning (RL) algorithms designed to enhance the efficiency and stability of LLM fine-tuning by selectively generating and optimizing only parts of the reasoning trajectory, rather than entire chain-of-thought (CoT) sequences. RPO enables substantial reductions in the computational burden typical of RL-based fine-tuning, maintains or improves model performance compared to full-path RL algorithms, and integrates seamlessly with existing policy optimization frameworks such as Group-Relative Policy Optimization (GRPO) and Divergence-Aware Policy Optimization (DAPO) (Yi et al., 27 Jan 2026).
1. Motivation and Conceptual Background
Traditional RL fine-tuning approaches for LLMs (e.g., PPO, GRPO, DAPO) require rolling out a full reasoning sequence for each query in every training step. This paradigm incurs:
- Excessive token-generation overhead: rollouts may require thousands of tokens per step.
- Compute under-utilization, as gradients are only computed after all rollouts complete.
- High-variance, low-signal updates due to delayed sparse rewards and unanchored sequence prefixes.
Partial Reasoning Optimization addresses these inefficiencies with the observation that not all tokens in a reasoning trajectory contribute equally to the final task reward. By "replaying" high-reward prefixes from an experience cache and only generating suffixes de novo for optimization, RPO focuses computational effort where it is most impactful (Yi et al., 27 Jan 2026).
2. Formal Problem Statement and Objective
Let be a dataset of queries. The LLM policy defines a distribution over trajectories , where yields a sparse final reward . The canonical RL objective is augmented with a Kullback–Leibler regularization term to a reference policy : with the gradient estimator decomposed token-wise using a clipped surrogate as in PPO/DAPO/GRPO: Here, is the policy ratio at time , and 0 is a normalized advantage (Yi et al., 27 Jan 2026).
3. RPO Algorithm: Experience Replay and Truncated Rollouts
The core algorithmic contribution of RPO consists of:
- Experience Cache 1: For each query 2, store the best-known full reasoning path 3 by achieved reward.
- Truncated Rollout: Instead of generating the full trajectory, sample a truncation length 4, retrieve the cached prefix 5 (first 6 tokens), and only generate the 7-token suffix. This drastically limits the length of rollouts needed during each optimization step.
- Suffix Optimization: For each truncated prefix, sample 8 suffixes, compute intermediate rewards (optionally with length-aware shaping), and apply the clipped policy gradient update.
- Cache Update: After each gradient step, the cache is refreshed in an 9-greedy manner, storing the highest-reward newly observed trajectory with probability 0.
The process is formally described in a multi-level pseudocode (see (Yi et al., 27 Jan 2026)), with notation preserved for reproducibility.
Token reduction is quantified by 1, yielding empirical rollouts reduced by approximately 95% in major experiments.
4. Theoretical Properties and Analytical Results
Variance Reduction
By conditioning exploration on high-reward prefixes, RPO provably reduces gradient variance compared to full-path policy gradient approaches: 2 improving learning signal stability and reducing susceptibility to policy collapse (Yi et al., 27 Jan 2026).
Reward Shaping and Bias-Variance Trade-offs
Inclusion of length-aware reward shaping—giving slightly higher rewards to shorter, correct completions—further reduces the mean squared error (MSE) of gradient estimation: 3 A plausible implication is that the choice of shaping parameter 4 allows explicit control over the stability–diversity trade-off in learning dynamics (Yi et al., 27 Jan 2026).
5. Integration with Existing RLHF Frameworks
RPO is a minimally invasive modification that changes only the rollout sampling strategy; the policy-gradient formulation (including advantage normalization, clipped surrogate, and KL regularizer) remains identical. This plug-and-play property allows integration into a wide spectrum of RLHF algorithms:
- GRPO (Group-Relative Policy Optimization): RPO replaces the “generate full rollout” step with “retrieve-and-truncate + suffix generation” within the same group-based advantage estimation.
- DAPO (Divergence-Aware Policy Optimization): The surrogate loss and trust-region parameters are untouched; only rollout construction is optimized.
All existing hyperparameter schedules, trust-region settings, and optimization recipes carry over without change.
6. Experimental Evaluation and Empirical Impact
Experiments in (Yi et al., 27 Jan 2026) employ DeepSeek-R1-Qwen-Distill models (1.5B and 7B parameters) across mathematics and reasoning-focused benchmarks (AIME25, AIME24, MATH500, AMC23, Minerva, OlympiadBench), with zero-shot evaluation via LightEval. Key findings include:
| Model + Method | Token Reduction | Training Time (hours) | Zero-shot Accuracy (avg, 6 datasets) |
|---|---|---|---|
| 1.5B + GRPO | base | 77.3 | 49.1% |
| 1.5B + RPO+shaping | 95% | 8.4 | 51.7% |
| 7B + GRPO | base | 84.5 | 65.6% |
| 7B + RPO+shaping | 95% | 23.5 | 67.8% |
Further:
- Average tokens per sample reduced from 2689 to 146 (1.5B), 2458 to 147 (7B).
- RPO achieves 90%–72% reduction in wall-clock training time (1.5B/7B), while slightly improving final accuracy.
- In long-run training, GRPO exhibits response-length collapse and ∼8.6% accuracy degradation, while RPO maintains length stability and delivers up to +5.4% accuracy gain (see Figure 1 and Table 2 in (Yi et al., 27 Jan 2026)).
7. Limitations, Trade-offs, and Future Work
One limitation of RPO is a reduction in sample diversity due to reusing identical prefixes. To mitigate collapsed exploration, length-aware reward shaping is essential. Prospective directions include:
- Dynamic scheduling of prefix truncation lengths to encourage broader exploration.
- Multi-query caching (tree/graph-structured replay) for further sample efficiency.
- Application of RPO to multi-modal RL, code generation, and instruction-following settings (Yi et al., 27 Jan 2026).
8. Related Methodologies and Contrasts
Partial Reasoning Optimization is distinct from broader "partial reward" or "partial path" approaches in RL for LLMs:
- Partial Reward Functions: In text-to-SQL and multi-stage tasks, RL frameworks (e.g. (Pourreza et al., 29 Mar 2025)) design fine-grained reward signals (schema-linking, syntax check, execution correctness), but still generate full output sequences.
- Branch-based Reasoning Optimization: Reasoning Paths Optimization (Chia et al., 2024) samples diverse trajectory continuations for contrastive learning, but focuses on preference objectives at each step and requires generating multiple alternatives for every prompt prefix.
RPO, by contrast, directly reduces rollout length through experience-truncation, with theoretical guarantees on variance and integrability with major RLHF algorithms (Yi et al., 27 Jan 2026). A plausible implication is that this approach sets a new baseline for efficient, scalable RL fine-tuning of LLMs in reasoning-heavy domains.
Key Reference:
- "RPO: Reinforcement Fine-Tuning with Partial Reasoning Optimization" (Yi et al., 27 Jan 2026)