Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Instructed Reinforcement Learning for Human-AI Coordination (2304.07297v2)

Published 13 Apr 2023 in cs.AI, cs.CL, cs.LG, and cs.MA

Abstract: One of the fundamental quests of AI is to produce agents that coordinate well with humans. This problem is challenging, especially in domains that lack high quality human behavioral data, because multi-agent reinforcement learning (RL) often converges to different equilibria from the ones that humans prefer. We propose a novel framework, instructRL, that enables humans to specify what kind of strategies they expect from their AI partners through natural language instructions. We use pretrained LLMs to generate a prior policy conditioned on the human instruction and use the prior to regularize the RL objective. This leads to the RL agent converging to equilibria that are aligned with human preferences. We show that instructRL converges to human-like policies that satisfy the given instructions in a proof-of-concept environment as well as the challenging Hanabi benchmark. Finally, we show that knowing the language instruction significantly boosts human-AI coordination performance in human evaluations in Hanabi.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
  2. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. arXiv preprint arXiv:2210.05492, 2022.
  3. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020.
  4. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
  5. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  6. On the utility of learning about humans for human-ai coordination. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  5175–5186, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/f5b1b89d98b7286673128a5fb112cb9a-Abstract.html.
  7. Generating diverse cooperative agents by learning incompatible policies. In ICML 2022 Workshop AI for Agent-Based Modelling, 2022. URL https://openreview.net/forum?id=a7vLnGKGIjY.
  8. Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
  9. K-level reasoning for zero-shot coordination in hanabi. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Nb03vOtUfz.
  10. Adversarial diversity in hanabi. In Submitted to The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=uLE3WF3-H_5. under review.
  11. Human-level play in the game of “diplomacy” by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. doi: 10.1126/science.ade9097. URL https://www.science.org/doi/abs/10.1126/science.ade9097.
  12. Minedojo: Building open-ended embodied agents with internet-scale knowledge, 2022. URL https://arxiv.org/abs/2206.08853.
  13. Reinforcement learning with deep energy-based policies. CoRR, abs/1702.08165, 2017. URL http://arxiv.org/abs/1702.08165.
  14. “other-play”for zero-shot coordination. In Proceedings of Machine Learning and Systems 2020, pp. 9396–9407. 2020.
  15. Off-belief learning. In International Conference on Machine Learning. PMLR, 2021.
  16. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. CoRR, abs/2201.07207, 2022. URL https://arxiv.org/abs/2201.07207.
  17. Large language models are zero-shot reasoners, 2022. URL https://arxiv.org/abs/2205.11916.
  18. Using natural language and program abstractions to instill human inductive biases in machines. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=buXZ7nIqiwE.
  19. Reward design with language models. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=10uNUgI5Kl.
  20. The boltzmann policy distribution: Accounting for systematic suboptimality in human models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=_l_QjPGN5ye.
  21. Learning social conventions in markov games. arXiv preprint arXiv:1806.10071, 2018.
  22. Trajectory diversity for zero-shot coordination. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  7204–7213. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/lupu21a.html.
  23. ELLA: exploration through learned language abstraction. CoRR, abs/2103.05825, 2021. URL https://arxiv.org/abs/2103.05825.
  24. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015. URL http://dx.doi.org/10.1038/nature14236.
  25. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783.
  26. Improving intrinsic exploration with language abstractions, 2022. URL https://arxiv.org/abs/2202.08938.
  27. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155.
  28. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020.
  29. Equivalence between policy gradients and soft q-learning. CoRR, abs/1704.06440, 2017a. URL http://arxiv.org/abs/1704.06440.
  30. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017b. URL http://arxiv.org/abs/1707.06347.
  31. On the critical role of conventions in adaptive human-ai collaboration. In Proceedings of the 9th International Conference on Learning Representations (ICLR), may 2021.
  32. Collaborating with humans without human data, 2022.
  33. Semantic exploration from language abstractions and pretrained representations, 2022. URL https://arxiv.org/abs/2204.05080.
  34. Alphastar: Mastering the real-time strategy game starcraft ii. DeepMind blog, pp.  2, 2019.
  35. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  36. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022. URL https://arxiv.org/abs/2201.11903.
  37. Maximum entropy inverse reinforcement learning. In Proc. AAAI, pp.  1433–1438, 2008.
  38. Fine-tuning language models from human preferences. CoRR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Hengyuan Hu (22 papers)
  2. Dorsa Sadigh (162 papers)
Citations (51)