Bridging Offline and Online Reinforcement Learning for LLMs (2506.21495v1)
Abstract: We investigate the effectiveness of reinforcement learning methods for finetuning LLMs when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.
Summary
- The paper demonstrates that semi-online RL achieves similar performance to fully online methods while significantly improving computational efficiency.
- It rigorously compares DPO and GRPO across verifiable and non-verifiable tasks, revealing distinct training dynamics and hyperparameter impacts.
- Findings highlight that combining mixed reward signals and careful tuning is essential for mitigating issues like response bias and entropy collapse.
Bridging Offline and Online Reinforcement Learning for LLMs: An Expert Analysis
This paper presents a comprehensive empirical paper of reinforcement learning (RL) methods for post-training LLMs, focusing on the transition from offline to semi-online to fully online regimes. The work systematically compares Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) across both verifiable (e.g., math) and non-verifiable (e.g., open-ended instruction following) tasks, providing detailed insights into training dynamics, hyperparameter selection, and the interplay between reward types.
Summary of Methods
The paper evaluates three principal RL-based post-training paradigms:
- Offline DPO: Trains on a static dataset of preference pairs, with no model updates during response generation.
- Semi-online DPO: Periodically synchronizes the generation model with the training model, allowing for more up-to-date responses while retaining some offline efficiency.
- Online DPO/GRPO: Synchronizes after every update, ensuring responses are always generated from the latest model parameters. GRPO, an on-policy variant inspired by PPO, leverages group-based relative advantages.
Both DPO and GRPO are evaluated on verifiable tasks (using binary rewards from a math verifier) and non-verifiable tasks (using scalar rewards from an LLM-based reward model).
Key Empirical Findings
The paper reports several strong and, in some cases, counterintuitive results:
- Semi-online and online methods consistently outperform offline DPO across both verifiable and non-verifiable tasks. For example, on the Math500 benchmark, online DPO achieves 58.7% accuracy versus 53.7% for offline DPO, and on AlpacaEval 2.0, online DPO achieves 83.1% winrate versus 53.2% for offline DPO.
- Semi-online DPO matches the performance of fully online DPO and GRPO, suggesting that frequent but not continuous synchronization is sufficient for optimal performance. This challenges the prevailing assumption that fully online RL is necessary for best results.
- Combining verifiable and non-verifiable rewards during training yields improved generalization. Models trained jointly on both reward types outperform those trained on either type alone, particularly on non-verifiable benchmarks.
- No significant performance difference is observed between online DPO and GRPO in most settings, despite their differing theoretical underpinnings and sample efficiency characteristics.
- Response length and entropy collapse are critical failure modes. In verifiable tasks, lack of reference model synchronization can lead to degenerate short responses and reduced entropy, while in non-verifiable tasks, models may exploit reward model biases by generating excessively long outputs.
Practical Implications
The findings have several direct implications for practitioners:
- Semi-online DPO offers a compelling trade-off between computational efficiency and performance. By decoupling response generation from every training step, it enables parallelization and reduced inference costs, which is especially valuable for large-scale LLMs.
- Joint training on mixed reward types is feasible and beneficial. This supports the development of more generalist LLMs capable of both precise reasoning and open-ended instruction following.
- Hyperparameter tuning is critical for stability. The paper highlights the importance of Adam epsilon, learning rate, and gradient clipping, especially for DPO, to prevent training divergence.
- Reward model selection and normalization strategies are essential to mitigate reward hacking and length bias, particularly in non-verifiable tasks.
Implementation Considerations
The authors provide detailed implementation notes:
- Hardware: Experiments are conducted on 32 NVIDIA H200 GPUs for training and up to 16 for inference, reflecting the substantial computational requirements for large-scale RL post-training.
- Software: The training pipeline is built on fairseq2 and vLLM, with Ray used for distributed inference and reward computation.
- Data: WildChat-1M is used for non-verifiable tasks, and NuminaMath for verifiable tasks, with Math-Verify as the verifier and Athene-RM-8B as the reward model.
- Batching and Synchronization: Semi-online DPO allows for asynchronous, batched response generation, which can be scaled with the synchronization interval parameter s.
Limitations and Open Questions
- Theoretical guarantees for semi-online DPO and GRPO remain underexplored. While empirical results are strong, the lack of formal analysis for off-policy updates in GRPO and the impact of synchronization frequency in DPO warrant further paper.
- Reward hacking and entropy collapse are persistent challenges, especially in the absence of robust reward models or verifiers.
- Generality across model architectures and tasks is not fully established, as experiments are limited to Llama-3.1-8B-Instruct.
Future Directions
- Extending semi-online RL to more complex, agentic, or asynchronous environments could further improve efficiency and applicability.
- Developing more robust, less bias-prone reward models for non-verifiable tasks would mitigate reward hacking and improve alignment.
- Exploring multi-reward and multi-task training at scale may yield more generalist and robust LLMs.
- Formalizing the theoretical properties of semi-online and hybrid RL methods would provide stronger guarantees and inform best practices.
Conclusion
This work provides a rigorous empirical foundation for the design of RL-based LLM post-training pipelines. The demonstration that semi-online DPO can match fully online methods in both verifiable and non-verifiable domains, while offering significant efficiency gains, is particularly notable. The results support a shift toward more flexible, asynchronous RL paradigms for LLM alignment, and highlight the importance of joint reward optimization for generalist model development. The detailed analysis of training dynamics, failure modes, and implementation strategies offers valuable guidance for both researchers and practitioners seeking to advance the state of LLM alignment.
Related Papers
- Robust Preference Optimization through Reward Model Distillation (2024)
- The Importance of Online Data: Understanding Preference Fine-tuning via Coverage (2024)
- Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback (2024)
- A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More (2024)
- Offline Reinforcement Learning for LLM Multi-Step Reasoning (2024)