Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards (2403.07708v2)
Abstract: Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align LLMs with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as \textit{contrastive rewards}. %Contrastive rewards Our approach involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses and used in the Proximal Policy Optimization (PPO) step. We show that contrastive rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We show empirically contrastive rewards can improve RLHF substantially, evaluated by both GPTs and humans, and our method consistently outperforms strong baselines.
- Concrete problems in ai safety, 2016.
- A general language assistant as a laboratory for alignment, 2021.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
- Constitutional ai: Harmlessness from ai feedback, 2022b.
- Red-teaming large language models using chain of utterances for safety-alignment, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Open problems and fundamental limitations of reinforcement learning from human feedback, 2023.
- A simple framework for contrastive learning of visual representations, 2020.
- Self-play fine-tuning converts weak language models to strong language models, 2024.
- Adversarial preference optimization. arXiv preprint arXiv:2311.08045, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL \urlhttps://lmsys.org/blog/2023-03-30-vicuna/.
- Deep reinforcement learning from human preferences, 2023.
- Reward model ensembles help mitigate overoptimization. ArXiv, abs/2310.02743, 2023. URL \urlhttps://api.semanticscholar.org/CorpusID:263620686.
- Ultrafeedback: Boosting language models with high-quality feedback, 2023.
- Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
- Raft: Reward ranked finetuning for generative foundation model alignment, 2023.
- Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
- Scaling laws for reward model overoptimization, 2022.
- Accelerate: Training and inference at scale made simple, efficient and adaptable. \urlhttps://github.com/huggingface/accelerate, 2022.
- Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639, 2023.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog, 2019.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023.
- Beyond reward: Offline preference-guided policy optimization, 2023.
- Alignment of language agents, 2021.
- Rl with kl penalties is better viewed as bayesian inference, 2022.
- The history and risks of reinforcement learning and human feedback, 2023.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference ron Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
- Policy optimization in rlhf: The impact of out-of-preference data. arXiv preprint arXiv:2312.10584, 2023.
- Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
- Statistical rejection sampling improves preference optimization, 2024.
- Decoupled weight decay regularization, 2019.
- Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Pitis, S. Failure modes of learning reward models for llms and other sequence models. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Zero: Memory optimizations toward training trillion parameter models, 2020.
- Proximal policy optimization algorithms, 2017.
- A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
- Defining and characterizing reward hacking, 2022.
- Learning to summarize from human feedback, 2022.
- Reinforcement learning: An introduction. MIT press, 2018.
- Llama 2: Open foundation and fine-tuned chat models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Zephyr: Direct distillation of lm alignment, 2023.
- Secrets of rlhf in large language models part ii: Reward modeling, 2024.
- Policy learning using weak supervision. Advances in Neural Information Processing Systems, 34:19960–19973, 2021.
- The optimal reward baseline for gradient-based reinforcement learning. arXiv preprint arXiv:1301.2315, 2013.
- Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. arXiv preprint arXiv:2310.00212, 2023.
- Self-rewarding language models, 2024.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Slic-hf: Sequence likelihood calibration with human feedback, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023a.
- Secrets of rlhf in large language models part i: Ppo, 2023b.
- Clusterability as an alternative to anchor points when learning with noisy labels. In International Conference on Machine Learning, pp. 12912–12923. PMLR, 2021.
- Unmasking and improving data credibility: A study with datasets for training harmless language models. arXiv preprint arXiv:2311.11202, 2023.
- Fine-tuning language models from human preferences, 2020.
- Wei Shen (181 papers)
- Xiaoying Zhang (32 papers)
- Yuanshun Yao (28 papers)
- Rui Zheng (78 papers)
- Hongyi Guo (14 papers)
- Yang Liu (2253 papers)