Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation (2403.05171v2)

Published 8 Mar 2024 in cs.LG and cs.AI

Abstract: We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for LLMs. Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  3. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  6. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  7. Adversarial preference optimization. arXiv preprint arXiv:2311.08045, 2023.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  9. Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
  10. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  11. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
  12. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  13. Openllama: An open reproduction of llama, May 2023.
  14. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  15. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  16. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
  17. Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
  18. Towards last-layer retraining for group robustness with fewer annotations. https://synthical.com/article/f641541d-124b-4974-9a73-d29f3f98c0b8, 8 2023.
  19. Surgical fine-tuning improves adaptation to distribution shifts, 2023.
  20. Policy optimization in rlhf: The impact of out-of-preference data. arXiv preprint arXiv:2312.10584, 2023.
  21. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
  22. Decoupled weight decay regularization, 2019.
  23. Prasanta Chandra Mahalanobis. On the generalized distance in statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 80:S1–S7, 2018.
  24. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022.
  27. Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation. In Björn B. Brandenburg, editor, 33rd Euromicro Conference on Real-Time Systems (ECRTS 2021), volume 196 of Leibniz International Proceedings in Informatics (LIPIcs), pages 1:1–1:18, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
  28. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  29. Zero: Memory optimizations toward training trillion parameter models, 2020.
  30. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
  31. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling, 2018.
  32. Proximal policy optimization algorithms, 2017.
  33. A long way to go: Investigating length correlations in rlhf, 2023.
  34. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460–9471, 2022.
  35. Learning to summarize from human feedback, 2022.
  36. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  38. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
  39. Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv:2401.06080, 2024.
  40. Large language models are not fair evaluators, 2023.
  41. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  42. Gaussian processes for regression. Advances in neural information processing systems, 8, 1995.
  43. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024.
  44. Neural contextual bandits with deep representation and shallow exploration, 2020.
  45. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  46. Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles. arXiv preprint arXiv:2401.00243, 2023.
  47. Provable offline reinforcement learning with human feedback. arXiv preprint arXiv:2305.14816, 2023.
  48. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  49. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons (2023). arXiv preprint arXiv:2301.11270, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiaoying Zhang (32 papers)
  2. Jean-Francois Ton (25 papers)
  3. Wei Shen (181 papers)
  4. Hongning Wang (107 papers)
  5. Yang Liu (2253 papers)
Citations (9)
X Twitter Logo Streamline Icon: https://streamlinehq.com