Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 57 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Evaluating Language Model Agency through Negotiations (2401.04536v2)

Published 9 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce an approach to evaluate LLM (LM) agency using negotiation games. This approach better reflects real-world use cases and addresses some of the shortcomings of alternative LM benchmarks. Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings. Noteworthy findings include: (i) only closed-source models tested here were able to complete these tasks; (ii) cooperative bargaining games proved to be most challenging to the models; and (iii) even the most powerful models sometimes "lose" to weaker opponents

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  2. Jacob Andreas. Language models as agent models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  5769–5779. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.FINDINGS-EMNLP.423. URL https://doi.org/10.18653/v1/2022.findings-emnlp.423.
  3. When will negotiation agents be able to represent us? the challenges and opportunities for autonomous negotiators. IJCAI, 2017.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  6. Rlboa: A modular reinforcement learning framework for autonomous negotiating agents. In AAMAS, pp.  260–268, 2019.
  7. Rio copa: A negotiation simulation. Columbia Caseworks, 2008.
  8. Elo uncovered: Robustness and best practices in language model evaluation. EMNLP, GEM Workshop, 2023.
  9. When gender matters in organizational negotiations. Annual Review of Organizational Psychology and Organizational Behavior, 9:199–223, 2022.
  10. A cultural analysis of the underlying assumptions of negotiation theory. In Negotiation theory and research, pp.  173–201. Psychology Press, 2006.
  11. Playing games with gpt: What can we learn about a large language model from canonical strategic games? Available at SSRN 4493398, 2023.
  12. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
  13. How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009, 2023.
  14. Deep reinforcement learning from human preferences. NeurIPS, 30, 2017.
  15. Contributors to Wikimedia projects. Ultimatum game - Wikipedia, September 2023. URL https://en.wikipedia.org/w/index.php?title=Ultimatum_game&oldid=1173609026. [Online; accessed 28. Sep. 2023].
  16. Open problems in cooperative ai. NeurIPS, Cooperative AI Workshop, 2020.
  17. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  18. The rating of chessplayers: Past and present. Arco Pub., 1978. ISBN 0668047216 9780668047210. URL http://www.amazon.com/Rating-Chess-Players-Past-Present/dp/0668047216.
  19. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023.
  20. First offers as anchors: The role of perspective-taking and negotiator focus. Journal of personality and social psychology, 81(4):657, 2001.
  21. Adversarial policies: Attacking deep reinforcement learning. In ICLR, 2020. URL https://openreview.net/forum?id=HJgEMpVFwB.
  22. Negotiation as a challenge problem for virtual humans. In Intelligent Virtual Agents: 15th International Conference, IVA 2015, Delft, The Netherlands, August 26-28, 2015, Proceedings 15, pp.  201–215. Springer, 2015.
  23. Human-level performance in no-press diplomacy via equilibrium search. ICLR, 2021.
  24. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  25. Fulin Guo. Gpt agents in game theory experiments. arXiv preprint arXiv:2305.05516, 2023.
  26. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303, 2022.
  27. Decoupling strategy and generation in negotiation dialogues. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), EMNLP, pp.  2333–2343, Brussels, Belgium, October-November 2018. ACL. doi: 10.18653/v1/D18-1256. URL https://aclanthology.org/D18-1256.
  28. Horace He. Horace He on X, September 2023. URL https://twitter.com/cHHillee/status/1635790330854526981. [Online; accessed 27. Sep. 2023].
  29. Generative models as a complex systems science: How can we make sense of large language model behavior?, 2023.
  30. Thought Cloning: Learning to think while acting by imitating human thinking. NeurIPS, 2023.
  31. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020.
  32. Negotiating agents. AI Magazine, 33(3):79–79, 2012.
  33. Flows: Building blocks of reasoning and collaborating ai. arXiv preprint arXiv:2308.01285, 2023.
  34. Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
  35. Reward design with language models. arXiv preprint arXiv:2303.00001, 2023.
  36. A unified game-theoretic approach to multiagent reinforcement learning. NeurIPS, 30, 2017.
  37. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
  38. An evolutionary learning approach for adaptive negotiation agents. International journal of intelligent systems, 21(1):41–72, 2006.
  39. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  40. What’s on your virtual mind? mind perception in human-agent negotiations. In Proceedings of the 19th ACM international conference on intelligent virtual agents, pp.  38–45, 2019.
  41. Deal or no deal? end-to-end learning of negotiation dialogues. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  2443–2453, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1259. URL https://aclanthology.org/D17-1259.
  42. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification.
  43. Negotiation among autonomous computational agents: principles, analysis and challenges. Artificial Intelligence Review, 29:1–44, 2008.
  44. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1906–1919. Association for Computational Linguistics, July 2020.
  45. Aaron Mok. We’ll all have AI assistants soon, Google AI cofounder says. Business Insider, September 2023. URL https://www.businessinsider.com/google-deepmind-cofounder-mustafa-suleyman-everyone-will-have-ai-assistant-2023-9?r=US&IR=T.
  46. Yohei Nakajima. babyagi, September 2023. URL https://github.com/yoheinakajima/babyagi. [Online; accessed 28. Sep. 2023].
  47. Cleo Nardo. The waluigi effect (mega-post). Less Wrong, 2023.
  48. Jim R Oliver. A machine-learning approach to automated negotiation and prospects for electronic commerce. Journal of management information systems, 13(3):83–112, 1996.
  49. OpenAI. Gpt-4 technical report, 2023.
  50. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), NeurIPS, 2022.
  51. Social simulacra: Creating populated prototypes for social computing systems. In UIST, pp.  1–18, 2022.
  52. Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  53. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022a.
  54. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022b.
  55. Billy Perrigo. The New AI-Powered Bing Is Threatening Users. That’s No Laughing Matter. Time, February 2023. URL https://time.com/6256529/bing-openai-chatgpt-danger-alignment.
  56. Yury Pinsky. Bard can now connect to your Google apps and services. Google, September 2023. URL https://blog.google/products/bard/google-bard-new-features-update-sept-2023.
  57. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4):515–526, 1978.
  58. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
  59. Kevin Roose. Why a Conversation With Bing’s Chatbot Left Me Deeply Unsettled. New York Times, February 2023. ISSN 0362-4331. URL https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html.
  60. Power and negotiation: Review of current evidence and future directions. Current opinion in psychology, 33:47–51, 2020.
  61. Toolformer: Language models can teach themselves to use tools. 37th Conference on Neural Information Processing Systems, 2023.
  62. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  63. Jared Spataro. Introducing Microsoft 365 Copilot – your copilot for work - The Official Microsoft Blog. Official Microsoft Blog, May 2023. URL https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work.
  64. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
  65. AutoGPT Team. AutoGPT, September 2023. URL https://github.com/Significant-Gravitas/AutoGPT. [Online; accessed 28. Sep. 2023].
  66. William Thomson. Chapter 35 Cooperative models of bargaining, volume 2, pp.  1237–1284. Elsevier, 1994. ISBN 978-0-444-89427-4. doi: 10.1016/S1574-0005(05)80067-0. URL https://linkinghub.elsevier.com/retrieve/pii/S1574000505800670.
  67. A.I. is the star of earnings calls as mentions skyrocket 77% with companies saying they’ll use for everything from medicine to cybersecurity. Fortune, March 2023. URL https://fortune.com/2023/03/01/a-i-earnings-calls-mentions-skyrocket-companies-say-search-cybersecurity-medicine-customer-service.
  68. Rob Toews. A Wave Of Billion-Dollar Language AI Startups Is Coming. Forbes, March 2022. URL https://www.forbes.com/sites/robtoews/2022/03/27/a-wave-of-billion-dollar-language-ai-startups-is-coming.
  69. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  70. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388, 2023.
  71. Prevalence and prevention of large language model use in crowd work. arXiv preprint arXiv:2310.15683, 2023a.
  72. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. AAAI, 2023b.
  73. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  74. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  75. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  76. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
  77. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  78. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022.
  79. React: Synergizing reasoning and acting in language models. ICLR, 2023.
  80. Analyzing information leakage of updates to natural language models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp.  363–375, 2020.
  81. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  82. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023.
Citations (12)

Summary

  • The paper introduces a negotiation game framework that dynamically assesses language model decision-making in interactive settings.
  • The study finds that even advanced models struggle with cooperative bargaining, highlighting limitations in current LM approaches.
  • Empirical tests with self-play and cross-play demonstrate that open-source models often underperform in negotiation tasks.

Introduction

LLMs (LMs) are witnessing widespread integration into systems where they exhibit behavior strikingly similar to that of human agents. This advancement has prompted a necessity for benchmarks that not only assess the functional accuracy of LMs but also their decision-making capabilities in dynamic scenarios. Traditional benchmarks, which primarily offer static evaluation, fall short in this respect. The paper under discussion introduces a new method of evaluating LMs using negotiation games, highlighting a shift towards more complex, interaction-based assessments that are better aligned with real-world applications.

Negotiation Games as Benchmarks

The proposed framework for LM evaluation capitalizes on negotiation games as they present a platform that mirrors the intricate nature of real-world tasks. These games are well-suited to mirroring the multi-turn interactions that LMs frequently engage in with users or other models. Additionally, they help to analyze how these models perform in cooperative environments and how they align with the desired outcomes. Unlike static benchmarks, negotiation games are dynamic, allowing for the modulation of complexity and avoiding the pitfalls of data leakage, which could otherwise skew evaluation results.

Empirical Results

The paper presents empirical studies wherein six publicly available LMs from numerous providers were tested using various negotiation games. These LMs were assessed based on self-play scenarios, where a model interacts with a clone of itself, and cross-play settings, where different models interact with one another. The results unearthed that open-source models are currently not equipped to handle the presented negotiation tasks effectively. Furthermore, cooperative bargaining games were identified as particularly demanding for LMs, suggesting that tasks requiring collaboration pose significant challenges. An intriguing outcome was that the most technically advanced models did not necessarily dominate these negotiation scenarios, indicating that raw power doesn't always equate to better performance in negotiations.

Conclusion and Framework Availability

The observations made in this paper indicate that there is no direct correlation between a model's complexity and its ability to negotiate effectively. Moreover, these insights underscore the need for new types of benchmarks to provide comprehensive evaluations of LMs. The authors have made the framework used for this paper openly available as a library, inviting other researchers and the open-source software community to replicate or build upon the paper's findings. All relevant materials, including the code and the produced data, can be accessed via a dedicated GitHub repository. This initiative ensures that the progress in LM evaluation is not only advanced but also transparent and accessible to the broader research community.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 32 posts and received 147 likes.