Emergent Mind

Abstract

The success of AI assistants based on LLMs hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Invariant Risk Minimization
  2. A General Language Assistant as a Laboratory for Alignment
  3. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. https://openreview.net/forum?id=Y4cs1Z3HnqL.

  4. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  5. On the dangers of stochastic parrots: Can language models be too big? In Madeleine Clare Elish, William Isaac, and Richard S. Zemel (eds.), FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, pp.  610–623. ACM, 2021. doi: 10.1145/3442188.3445922. https://doi.org/10.1145/3442188.3445922.
  6. On the Opportunities and Risks of Foundation Models
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345
  8. Bayesian robust optimization for imitation learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. https://proceedings.neurips.cc/paper/2020/hash/1a669e81c8093745261889539694be7f-Abstract.html.

  9. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
  10. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study
  11. When does group invariant learning survive spurious correlations? In NeurIPS, 2022. http://papers.nips.cc/paper_files/paper/2022/hash/2e351740d4ec4200df6160f34cd181c3-Abstract-Conference.html.

  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.

  13. Environment inference for invariant learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  2189–2200. PMLR, 2021. http://proceedings.mlr.press/v139/creager21a.html.

  14. Goal misgeneralization in deep reinforcement learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  12004–12019. PMLR, 2022. https://proceedings.mlr.press/v162/langosco22a.html.

  15. Improving alignment of dialogue agents via targeted human judgements
  16. The curious case of neural text degeneration
  17. Aligning Language Models with Offline Learning from Human Feedback
  18. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
  19. Policy gradient bayesian robust optimization for imitation learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  4785–4796. PMLR, 2021. http://proceedings.mlr.press/v139/javed21a.html.

  20. Out-of-distribution generalization via risk extrapolation (rex). In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  5815–5826. PMLR, 2021. http://proceedings.mlr.press/v139/krueger21a.html.

  21. Preventing reward hacking with occupancy measure regularization. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems
  22. Large-scale methods for distributionally robust optimization. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. https://proceedings.neurips.cc/paper/2020/hash/64986d86a17424eeac96b08a6d519059-Abstract.html.

  23. ZIN: when and how to learn invariance without environment partition? In NeurIPS, 2022. http://papers.nips.cc/paper_files/paper/2022/hash/9b77f07301b1ef1fe810aae96c12cb7b-Abstract-Conference.html.

  24. Just train twice: Improving group robustness without training group information. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  6781–6792. PMLR, 2021. http://proceedings.mlr.press/v139/liu21f.html.

  25. On The Fragility of Learned Reward Functions
  26. Asynchronous methods for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp.  1928–1937. JMLR.org, 2016. http://proceedings.mlr.press/v48/mniha16.html.

  27. The Alignment Problem from a Deep Learning Perspective
  28. Training language models to follow instructions with human feedback. In NeurIPS, 2022. http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.

  29. The effects of reward misspecification: Mapping and mitigating misaligned models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. https://openreview.net/forum?id=JYtwGwIL7ye.

  30. Karl Pearson. X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175
  31. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  32. Zero: Memory optimizations toward training trillion parameter models
  33. Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization
  34. Distributionally robust neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. https://openreview.net/forum?id=ryxGuJrFvS.

  35. Proximal Policy Optimization Algorithms
  36. High-dimensional continuous control using generalized advantage estimation
  37. Defining and Characterizing Reward Hacking
  38. Invariant policy optimization: Towards stronger generalization in reinforcement learning. In Ali Jadbabaie, John Lygeros, George J. Pappas, Pablo A. Parrilo, Benjamin Recht, Claire J. Tomlin, and Melanie N. Zeilinger (eds.), Proceedings of the 3rd Annual Conference on Learning for Dynamics and Control, L4DC 2021, 7-8 June 2021, Virtual Event, Switzerland, volume 144 of Proceedings of Machine Learning Research, pp.  21–33. PMLR, 2021. http://proceedings.mlr.press/v144/sonar21a.html.

  39. Reward Collapse in Aligning Large Language Models
  40. Learning to summarize from human feedback
  41. Worst cases policy gradients. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura (eds.), 3rd Annual Conference on Robot Learning, CoRL 2019, Osaka, Japan, October 30 - November 1, 2019, Proceedings, volume 100 of Proceedings of Machine Learning Research, pp.  1078–1093. PMLR, 2019. http://proceedings.mlr.press/v100/tang20a.html.

  42. LaMDA: Language Models for Dialog Applications
  43. Causal confusion and reward misidentification in preference-based reward learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. https://openreview.net/pdf?id=R0Xxvr_X3ZA.

  44. Llama 2: Open Foundation and Fine-Tuned Chat Models
  45. Optimal policies tend to seek power. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  23063–23074, 2021. https://proceedings.neurips.cc/paper/2021/hash/c26820b8a4c1b3c2aa868d6d57e14a79-Abstract.html.

  46. On calibration and out-of-domain generalization. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  2215–2227, 2021. https://proceedings.neurips.cc/paper/2021/hash/118bd558033a1016fcc82560c65cca5f-Abstract.html.

  47. Group distributionally robust reinforcement learning with hierarchical latent variables. In Francisco J. R. Ruiz, Jennifer G. Dy, and Jan-Willem van de Meent (eds.), International Conference on Artificial Intelligence and Statistics, 25-27 April 2023, Palau de Congressos, Valencia, Spain, volume 206 of Proceedings of Machine Learning Research, pp.  2677–2703. PMLR, 2023. https://proceedings.mlr.press/v206/xu23d.html.

  48. Learning invariant representations for reinforcement learning without reconstruction. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. https://openreview.net/forum?id=-2FCwDKRREu.

  49. The wisdom of hindsight makes language models better instruction followers. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  41414–41428. PMLR, 2023. https://proceedings.mlr.press/v202/zhang23ab.html.

  50. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  51. Secrets of RLHF in Large Language Models Part I: PPO
  52. Fine-Tuning Language Models from Human Preferences

Show All 52