Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization (2401.07181v1)
Abstract: We introduce a method to address goal misgeneralization in reinforcement learning (RL), leveraging LLM feedback during training. Goal misgeneralization, a type of robustness failure in RL occurs when an agent retains its capabilities out-of-distribution yet pursues a proxy rather than the intended one. Our approach utilizes LLMs to analyze an RL agent's policies during training and identify potential failure scenarios. The RL agent is then deployed in these scenarios, and a reward model is learnt through the LLM preferences and feedback. This LLM-informed reward model is used to further train the RL agent on the original dataset. We apply our method to a maze navigation task, and show marked improvements in goal generalization, especially in cases where true and proxy goals are somewhat distinguishable and behavioral biases are pronounced. This study demonstrates how the LLM, despite its lack of task proficiency, can efficiently supervise RL agents, providing scalable oversight and valuable insights for enhancing goal-directed learning in RL through the use of LLMs.
- Concrete problems in ai safety, 2016.
- Learning from human preferences, openai, 2017. URL https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
- Constitutional ai: Harmlessness from ai feedback, 2022b.
- Nick Bostrom. Superintelligence: Paths, dangers, strategies. Oxford University Press, Inc., USA, 1st edition:ISBN 0199678111, 2014.
- Measuring progress on scalable oversight for large language models, 2022.
- Towards monosemanticity: Decomposing language models with dictionary learning, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Harms from increasingly agentic algorithmic systems. In 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, jun 2023. doi: 10.1145/3593013.3594033. URL https://doi.org/10.1145%2F3593013.3594033.
- Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
- Leveraging procedural generation to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588, 2019.
- Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.
- Dermatologist-level classification of skin cancer with deep neural networks, 2017.
- Reinforcement learning with a corrupted reward channel, 2017.
- Iason Gabriel. Artificial intelligence, values, and alignment. Minds and Machines, 30(3):411–437, sep 2020. doi: 10.1007/s11023-020-09539-2. URL https://doi.org/10.1007%2Fs11023-020-09539-2.
- Aligning ai with shared human values, 2023.
- Evan Hubinger. An overview of 11 proposals for building safe advanced ai, 2020.
- Reward learning from human preferences and demonstrations in atari, 2018.
- Ai safety via debate, 2018.
- Highly accurate protein structure prediction with alphafold, 2021.
- Adam: A method for stochastic optimization, 2017.
- Motif: Intrinsic motivation from artificial intelligence feedback, 2023.
- The effects of reward misspecification: Mapping and mitigating misaligned models, Specification gaming: the flip side of AI ingenuity.
- Goal misgeneralization in deep reinforcement learning, 2023.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023.
- Hojoon Lee. Training procgen environment with py- torch. https://github.com/joonleesky/train-procgen-pytorch, 2020.
- Ai safety gridworlds, 2017.
- Scalable agent alignment via reward modeling: a research direction, 2018.
- Locating and editing factual associations in gpt, 2023.
- The effects of reward misspecification: Mapping and mitigating misaligned models, 2022.
- Red teaming language models with language models, 2022.
- Large language models sensitivity to the order of options in multiple-choice questions., 2023.
- Self-critiquing models for assisting human evaluators, 2022.
- Proximal policy optimization algorithms, 2017a.
- Proximal policy optimization algorithms, 2017b.
- An exploration of goal misgeneralization. Journal of AI Research, 45:10–20, 2022.
- When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels, 2022.
- Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016. doi: 10.1038/nature16961.
- Defining and characterizing reward gaming. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 9460–9471. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/3d719fee332caa23d5038b8a90e81796-Paper-Conference.pdf.
- Constructing unrestricted adversarial examples with generative models, 2018.
- Learning to summarize from human feedback, 2022.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning, 2019.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022.
- Ethical-advice taker: Do language models understand natural language interventions?, 2021.
- Fine-tuning language models from human preferences, 2020.
- Adversarial training for high-stakes reliability, 2022.
- Houda Nait El Barj (2 papers)
- Theophile Sautory (4 papers)