Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reward Model Ensembles Help Mitigate Overoptimization (2310.02743v2)

Published 4 Oct 2023 in cs.LG

Abstract: Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning LLMs to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Improving multimodal interactive agents with reinforcement learning from human feedback. arXiv preprint arXiv:2211.11602, 2022.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  5. Convex optimization. Cambridge university press, 2004.
  6. Disagreement-regularized imitation learning. In International Conference on Learning Representations, 2019.
  7. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  8. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  9. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.
  10. Faulty reward functions in the wild. https://openai.com/research/faulty-reward-functions, 2016.
  11. Learning and policy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127, 2016.
  12. Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp.  1–15. Springer, 2000.
  13. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  14. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  15. Improving PILCO with Bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, International Conference on Machine Learning, 2016.
  16. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.  10835–10866. PMLR, 2023.
  17. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  18. Mikael Henaff. Explicit explore-exploit algorithms in continuous state spaces. Advances in Neural Information Processing Systems, 32, 2019.
  19. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018.
  20. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  21. Specification gaming: The flip side of ai ingenuity. April 2020. URL https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity.
  22. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  23. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021.
  24. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. Artificial life, 26(2):274–306, 2020.
  25. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  26. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  27. Timothy Prickett Morgan. Counting the cost of training large language models, 2022. URL https://www.nextplatform.com/2022/12/01/counting-the-cost-of-training-large-language-models/.
  28. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  29. OpenAI. Openai models. https://platform.openai.com/docs/models/, 2023a.
  30. OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023b.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  32. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
  33. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022.
  34. Self-supervised exploration via disagreement. In International conference on machine learning, pp.  5062–5071. PMLR, 2019.
  35. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  36. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  37. Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
  38. John Schulman. Approximating kl divergence, 2020. URL http://joschu.net/blog/kl-approx.html.
  39. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  40. Model-based active exploration. In International conference on machine learning, pp.  5779–5788. PMLR, 2019.
  41. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460–9471, 2022.
  42. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  43. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  44. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
  45. Mosaic llms (part 2): Gpt-3 quality for <500absent500<500< 500 k, 2022. URL https://www.mosaicml.com/blog/gpt-3-quality-for-500k.
  46. Stephen J Wright. Numerical optimization. 2006.
  47. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021a.
  48. Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140, 2021b.
  49. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020a.
  50. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020b.
  51. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  52. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  53. Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964, 2023.
  54. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33:15763–15773, 2020.
  55. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Thomas Coste (5 papers)
  2. Usman Anwar (14 papers)
  3. Robert Kirk (21 papers)
  4. David Krueger (75 papers)
Citations (81)