LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game (2510.04980v1)

Published 6 Oct 2025 in cs.AI and cs.CL

Abstract: Effective multi-agent collaboration requires agents to infer the rationale behind others' actions, a capability rooted in Theory-of-Mind (ToM). While recent LLMs excel at logical inference, their ability to infer rationale in dynamic, collaborative settings remains under-explored. This study introduces LLM-Hanabi, a novel benchmark that uses the cooperative game Hanabi to evaluate the rationale inference and ToM of LLMs. Our framework features an automated evaluation system that measures both game performance and ToM proficiency. Across a range of models, we find a significant positive correlation between ToM and in-game success. Notably, first-order ToM (interpreting others' intent) correlates more strongly with performance than second-order ToM (predicting others' interpretations). These findings highlight that for effective AI collaboration, the ability to accurately interpret a partner's rationale is more critical than higher-order reasoning. We conclude that prioritizing first-order ToM is a promising direction for enhancing the collaborative capabilities of future models.

Summary

The paper introduces LLM-Hanabi, a benchmark that quantifies Theory-of-Mind and rationale inference through natural language gameplay in Hanabi.
It demonstrates that first-order ToM, where agents interpret hints directly, is more predictive of game success than second-order reasoning.
Experimental results indicate that large reasoning models outperform standard LLMs, suggesting that enhancing direct rationale inference is crucial for improved collaborative AI.

LLM-Hanabi: A Benchmark for Evaluating Theory-of-Mind and Rationale Inference in Multi-Agent Imperfect Information Games

Motivation and Context

The paper introduces LLM-Hanabi, a benchmark designed to rigorously evaluate the Theory-of-Mind (ToM) and rationale inference capabilities of LLMs and Large Reasoning Models (LRMs) in dynamic, multi-agent collaborative environments. Unlike prior ToM benchmarks that focus on static, text-based tasks, LLM-Hanabi leverages the cooperative card game Hanabi, which is characterized by imperfect information and the necessity for agents to infer intent from sparse, ambiguous communication. This environment provides a controlled yet challenging testbed for assessing the inferential and collaborative reasoning skills of LLMs.

LLM-Hanabi Benchmark Design

LLM-Hanabi operationalizes the Hanabi game by translating all game states and actions into natural language, enabling LLM-driven agents to interact with the environment and each other through structured prompts. The benchmark is fully automated, supporting scalable evaluation across a wide range of models.

Figure 1: Overview of the gameplay and evaluation process of LLM-Hanabi.

The evaluation framework is centered on two axes:

Game Performance: Quantified by the final Hanabi score, reflecting the agents' ability to achieve the cooperative objective under imperfect information.
ToM Performance: Assessed via structured extraction of rationale and ToM statements during gameplay, followed by post-hoc scoring using an LLM-as-a-judge.

Theory-of-Mind Evaluation Protocol

For each hint action, the following are extracted:

Rationale: The hinter's explicit justification for the hint.
First-Order ToM: The recipient's interpretation of the hinter's intent.
Second-Order ToM: The hinter's prediction of how the recipient will interpret the hint.

After each game, the LLM-as-a-judge computes:

First-Order ToM Score: Alignment between rationale and recipient's interpretation.
Second-Order ToM Score: Alignment between recipient's interpretation and hinter's prediction.

This protocol enables quantitative, scalable assessment of both direct and higher-order inferential reasoning.

Experimental Evaluation

Model Selection and Setup

The benchmark evaluates a diverse set of LLMs and LRMs, including open-source and proprietary models spanning a range of parameter counts and architectures (e.g., Llama-3.1, Llama-4, Deepseek-R1, QwQ-32B, Qwen3, GPT-4.1, Gemini-2.5). All models are prompted using Chain-of-Thought (CoT) reasoning and are evaluated in a 5-player Hanabi configuration to maximize interaction complexity.

Metrics

Game Score: Aggregate of the highest card ranks played across all color stacks (max 25).
ToM Score: Mean of first- and second-order ToM scores across all hint interactions.

Results

LRMs consistently outperform standard LLMs in both game and ToM metrics. Deepseek-R1 achieves the highest average game score (30.00), with QwQ-32B and GPT-4.1 also demonstrating strong performance. In ToM evaluation, GPT-4.1 and Deepseek-R1 are the top performers in their respective categories.

A salient finding is the persistent gap between first-order and second-order ToM scores across all models: first-order ToM (recipient's inference of intent) is significantly higher than second-order ToM (hinter's prediction of recipient's inference). This suggests that while models are adept at direct rationale inference, they struggle with recursive mental state modeling.

Figure 2: Correlation analysis between game score and theory-of-mind performance.

Correlation analysis reveals a strong positive relationship between ToM proficiency and game performance. Notably, first-order ToM exhibits a higher correlation with game success ( $r=0.76$ ) than second-order ToM ( $r=0.58$ ), indicating that the ability to accurately interpret a partner's rationale is more critical for effective collaboration than higher-order belief modeling.

Implications and Future Directions

The results have several implications for the development of collaborative AI systems:

First-Order ToM as a Priority: The empirical evidence that first-order ToM is a stronger predictor of collaborative success than second-order ToM suggests that future model development and alignment efforts should prioritize direct rationale inference capabilities.
Benchmark Utility: LLM-Hanabi provides a scalable, automated, and ecologically valid framework for evaluating multi-agent reasoning, filling a gap left by static or adversarial benchmarks.
Model Architecture: The superior performance of LRMs, particularly those trained with explicit reasoning objectives or reinforcement learning, indicates that architectural and training innovations targeting inferential reasoning yield tangible benefits in collaborative settings.
Generalization: While Hanabi is an idealized environment, the methodology can be extended to other imperfect information games and real-world multi-agent scenarios, including negotiation, coordination, and human-AI teaming.

Limitations

The benchmark is currently restricted to the Hanabi environment and language-based reasoning. It does not address multimodal or non-linguistic collaboration, and the reliance on an LLM-as-a-judge introduces potential biases. Fixed hyperparameters and prompt formats may also limit generalizability. Future work should explore broader environments, alternative evaluation modalities, and human-in-the-loop assessments.

Conclusion

LLM-Hanabi establishes a rigorous, automated benchmark for evaluating rationale inference and Theory-of-Mind in collaborative agents. The findings demonstrate that large reasoning models substantially outperform standard LLMs in both gameplay and ToM, and that first-order ToM is a more significant determinant of collaborative success than higher-order reasoning. These insights provide a concrete direction for the development of more effective, interpretable, and collaborative AI systems.