Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning (2404.19409v1)
Abstract: While Reinforcement Learning (RL) has been proven essential for tuning LLMs, it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
- An actor-critic algorithm for sequence prediction. In Proc. of International Conference on Learning Representations (ICLR).
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Scheduled sampling for sequence prediction with recurrent neural networks. In Proc. of Advances in Neural Information Processing Systems (NIPS).
- Deep reinforcement learning from human preferences. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- Training language gans from scratch. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- Scaling laws for reward model overoptimization. In Proc. of International Conference on Machine Learning (ICML).
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- Deep q-learning from demonstrations. In Proc. of AAAI conference on artificial intelligence.
- The curious case of neural text degeneration. In Proc. of International Conference on Learning Representations (ICLR).
- LoRA: Low-rank adaptation of large language models. In Proc. of International Conference on Learning Representations (ICLR).
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Openassistant conversations–democratizing large language model alignment. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- Multi-agent communication meets natural language: Synergies between functional and structural language learning. In Proc. of the Association for Computational Linguistics (ACL).
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Countering language drift via visual grounding. In Proc. of the conference on Empirical Methods in Natural Language Processing (EMNLP).
- Olivier Lemon and Olivier Pietquin. 2007. Machine learning for spoken dialogue systems. In Proc. of European Conference on Speech Communication and Technologies (Interspeech).
- Deal or no deal? end-to-end learning for negotiation dialogues. In Proc. of the conference on Empirical Methods in Natural Language Processing (EMNLP).
- Countering language drift with seeded iterated learning. In Proc. of International Conference on Machine Learning (ICML).
- Learning word vectors for sentiment analysis. In Proc. of the Association for Computational Linguistics (ACL).
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
- Confronting reward model overoptimization with constrained rlhf. In Proc. of International Conference on Learning Representations (ICLR).
- Overcoming exploration in reinforcement learning with demonstrations. In Proc. of International Conference on Robotics and Automation (ICRA).
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Language model alignment with elastic reset. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- A deep reinforced model for abstractive summarization. In Proc. of International Conference on Learning Representations (ICLR).
- Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG), 37(4):1–14.
- Guided reinforcement learning with learned skills. In Proc. of Conference on Robot Learning (CoRL).
- Direct preference optimization: Your language model is secretly a reward model. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Proceedings of Robotics: Science and Systems (RSS).
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. In Proc. of International Conference on Learning Representations (ICLR).
- Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Proc. of International Conference on Learning Representations (ICLR).
- Warm: On the benefits of weight averaged reward models. arXiv preprint arXiv:2401.12187.
- Model-free reinforcement learning from expert demonstrations: a survey. Artificial Intelligence Review, pages 1–29.
- Sequence level training with recurrent neural networks. In Proc. of International Conference on Learning Representations (ICLR).
- Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proc. of the Association for Computational Linguistics (ACL).
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Stefan Schaal. 1996. Learning from demonstration. In Proc. of Advances in Neural Information Processing Systems (NIPS).
- A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review, 21(2):97–126.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Reinforcement learning for spoken dialogue systems. In Proc. of Advances in Neural Information Processing Systems (NIPS).
- Defining and characterizing reward gaming. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- Learning to summarize with human feedback. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI).
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Tl; dr: Mining reddit to learn automatic summarization. In Proc. of the Workshop on New Frontiers in Summarization.
- Airdialogue: An environment for goal-oriented dialogue research. In Proc. of the conference on Empirical Methods in Natural Language Processing (EMNLP).
- Wikimedia. 2023. Wikimedia downloads. https://dumps.wikimedia.org.
- Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
- Yuxiang Wu and Baotian Hu. 2018. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence.
- A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893.
- Calibrating sequence likelihood improves conditional language generation. In Proc. of International Conference on Learning Representations (ICLR).
- Reinforcement and imitation learning for diverse visuomotor skills. In Proc. of Advances in Neural Information Processing Systems (NeurIPS).
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Mathieu Rita (7 papers)
- Florian Strub (39 papers)
- Rahma Chaabouni (15 papers)
- Paul Michel (27 papers)
- Emmanuel Dupoux (81 papers)
- Olivier Pietquin (90 papers)