Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Evaluating Language Model Agency through Negotiations (2401.04536v2)

Published 9 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce an approach to evaluate LLM (LM) agency using negotiation games. This approach better reflects real-world use cases and addresses some of the shortcomings of alternative LM benchmarks. Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings. Noteworthy findings include: (i) only closed-source models tested here were able to complete these tasks; (ii) cooperative bargaining games proved to be most challenging to the models; and (iii) even the most powerful models sometimes "lose" to weaker opponents

References (82)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a negotiation game framework that dynamically assesses language model decision-making in interactive settings.
The study finds that even advanced models struggle with cooperative bargaining, highlighting limitations in current LM approaches.
Empirical tests with self-play and cross-play demonstrate that open-source models often underperform in negotiation tasks.

Introduction

LLMs (LMs) are witnessing widespread integration into systems where they exhibit behavior strikingly similar to that of human agents. This advancement has prompted a necessity for benchmarks that not only assess the functional accuracy of LMs but also their decision-making capabilities in dynamic scenarios. Traditional benchmarks, which primarily offer static evaluation, fall short in this respect. The paper under discussion introduces a new method of evaluating LMs using negotiation games, highlighting a shift towards more complex, interaction-based assessments that are better aligned with real-world applications.

Negotiation Games as Benchmarks

The proposed framework for LM evaluation capitalizes on negotiation games as they present a platform that mirrors the intricate nature of real-world tasks. These games are well-suited to mirroring the multi-turn interactions that LMs frequently engage in with users or other models. Additionally, they help to analyze how these models perform in cooperative environments and how they align with the desired outcomes. Unlike static benchmarks, negotiation games are dynamic, allowing for the modulation of complexity and avoiding the pitfalls of data leakage, which could otherwise skew evaluation results.

Empirical Results

The paper presents empirical studies wherein six publicly available LMs from numerous providers were tested using various negotiation games. These LMs were assessed based on self-play scenarios, where a model interacts with a clone of itself, and cross-play settings, where different models interact with one another. The results unearthed that open-source models are currently not equipped to handle the presented negotiation tasks effectively. Furthermore, cooperative bargaining games were identified as particularly demanding for LMs, suggesting that tasks requiring collaboration pose significant challenges. An intriguing outcome was that the most technically advanced models did not necessarily dominate these negotiation scenarios, indicating that raw power doesn't always equate to better performance in negotiations.

Conclusion and Framework Availability

The observations made in this paper indicate that there is no direct correlation between a model's complexity and its ability to negotiate effectively. Moreover, these insights underscore the need for new types of benchmarks to provide comprehensive evaluations of LMs. The authors have made the framework used for this paper openly available as a library, inviting other researchers and the open-source software community to replicate or build upon the paper's findings. All relevant materials, including the code and the produced data, can be accessed via a dedicated GitHub repository. This initiative ensures that the progress in LM evaluation is not only advanced but also transparent and accessible to the broader research community.