Understanding the performance gap between online and offline alignment algorithms
Abstract: Reinforcement learning from human feedback (RLHF) is the canonical framework for LLM alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse at pairwise classification. This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process. Lastly, we observe that the performance discrepancy persists for both contrastive and non-contrastive loss functions, and appears not to be addressed by simply scaling up policy networks. Taken together, our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Anthropic, 2024. URL https://www.anthropic.com/news/claude-3-family.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635, 2024.
- Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR, 2019.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
- Charles AE Goodhart. Problems of monetary management: the UK experience. Springer, 1984.
- Google. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Google. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023b.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Nash learning from human feedback. arXiv, 2023.
- The difficulty of passive learning in deep reinforcement learning. Advances in Neural Information Processing Systems, 34:23283–23295, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
- Improving language understanding by generative pre-training. 2018.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- From r to q: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024.
- Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Vertex, 2024. URL https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval#autosxs.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719, 2024.
- Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.