Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game

Published 13 Jan 2026 in cs.CL | (2601.09017v1)

Abstract: The rapid advancement of LLMs has necessitated more robust evaluation methods that go beyond static benchmarks, which are increasingly prone to data saturation and leakage. In this paper, we propose a dynamic benchmarking framework for evaluating multilingual and multicultural capabilities through the social deduction game Spyfall. In our setup, models must engage in strategic dialogue to either identify a secret agent or avoid detection, utilizing culturally relevant locations or local foods. Our results show that our game-based rankings align closely with the Chatbot Arena. However, we find a significant performance gap in non-English contexts: models are generally less proficient when handling locally specific entities and often struggle with rule-following or strategic integrity in non-English languages. We demonstrate that this game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks. The game history can be accessed here https://huggingface.co/datasets/haryoaw/cultural-spyfall.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a dynamic benchmarking framework using a multilingual adaptation of Spyfall to assess LLMs' cultural and linguistic capabilities.
It employs turn-based rounds with role-specific interactions and localized entity pools in Indonesian, Simplified Chinese, and Egyptian Arabic to evaluate model performance.
Empirical findings reveal pronounced performance drops in non-English scenarios, underscoring key weaknesses in LLMs' cultural understanding and inference.

Introduction and Motivation

The evolution of LLMs toward advanced multilingual and multicultural understanding exposes limitations in traditional, static NLP benchmarks. These benchmarks are vulnerable to both training-data overlap ("leakage") and domain saturation, curbing their ability to robustly distinguish fine-grained real-world capabilities. "Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game" (2601.09017) addresses these gaps by deploying a turn-based, multilingual adaptation of the social deduction game Spyfall as a dynamic benchmarking framework. The focus is on probing not only linguistic competence but also deep, contextually situated cultural knowledge—by tasking models to engage in strategic, high-stakes inference and dialog in contextually rich, non-English environments.

Methodology: Multicultural, Multilingual Spyfall Design

Game Adaptation and Dynamic Interactivity

The benchmark adapts Spyfall for LLM play in three major respects:

Turn-based, multi-phase structure: To accommodate LLM latency constraints and enforce controlled, observable reasoning sequences, the authors implement a deterministic, turn-limited cycle: (1) Round Robin Q&A; (2) Free Cycle for further interrogation and strategic voting; (3) Opportunity for the spy to guess the target entity.
Culturally enriched, multilingual entity pools: Instead of only generic English locales, the game incorporates locally specific places and foods in three target languages (Indonesian, Simplified Chinese, Egyptian Arabic), with 30 localized entities per scenario.
Prompt engineering and structured I/O: Models receive full context, history, and detailed, language-specific format instructions. Invalid outputs or rule-breaking automatically trigger losses, operationalizing compliance and prompt-following as part of model capability.

Evaluation Strategy

Benchmarked models include six leading LLMs encompassing strong proprietary models (Gemini 2.5-Pro/Flash), strong and mid-sized open models (Qwen3-30B, Qwen3-8B, Gemma-12B, Llama3.1-8B), tested extensively (9,000 games) using the Bradley-Terry paired comparison model for ranking. Performance is decomposed by entity class (generic, local location, local food), by language, and by player role (spy vs. non-spy), with win rate, information leakage, and voting behavior as core metrics.

Key Empirical Findings

Alignment with Human Preference Benchmarks

Model capability rankings in Multicultural Spyfall exhibit strong concordance with those from Chatbot Arena, an established human-preference-based evaluation platform, implying that dynamic, adversarial dialog closely tracks general model skill.

Sharp Proficiency Drop in Non-English Cultural Contexts

A core result is the substantial degradation of nearly all LLMs in non-English, culturally specific scenarios. Even top-tier models exhibit significantly higher frequency of strategic errors, rule discrepancies, or outright entity leaks when reasoning in Indonesian, Egyptian Arabic, and Simplified Chinese as compared to English. The impact is accentuated in food-related (vs. location) scenarios.

For instance:
- Gemini models, which lead the rankings, maintain 0% information leakage in all scenarios, but their win rate as spies drops sharply in Egyptian Arabic food and location rounds.
- Llama3.1-8B demonstrates very high leakage rates (up to 48%), especially in Indonesian, often inadvertently disclosing the target entity.

Distinct Model Behavior and Voting Dynamics

Spy vs. Non-Spy roles: Models show asymmetric competence depending on their assigned role. Top models excel at blending in as spies, manipulating voting to avoid detection, especially via subtle generic responses and vote dispersion. In weaker models, spies are reliably detected and voted out due to behavioral artifacts or unconvincing answers.
Language compliance: Most models adhere strongly to the target language, except in dialectal settings (Egyptian Arabic); here, models frequently slip into Modern Standard Arabic even after being instructed to stick to the dialect, likely reflecting training data imbalances.
Entity guessability and entropy: Local food entities in Egyptian Arabic and Indonesian prove most difficult to infer, with high-entropy distributions of guesses and low spy accuracy, underlining present LLMs' lack of regional cultural grounding.

Error Modes and Model Weaknesses

Qualitative error analysis highlights failure cases such as:

Spies giving inappropriately generic or contextually inaccurate answers, failing to grasp local semantic nuances.
Non-spies playing as weaker models misunderstanding cultural references or ignoring game rules, leading to high leakage or incorrect votes.
Spy detection often hinges on specific entity knowledge (e.g., failing to identify "Binus" as a renowned private university in Jakarta), with strategic "fishy" questions betraying the spy to attentive non-spy models.

Theoretical and Practical Implications

Saturation and Data Leakage Resistance

This framework demonstrates robustness against data contamination; it is inherently difficult if not impossible for a static training set to encode the combinatorial and interactive game histories, thus overcoming a critical weakness of existing static benchmarks.

Scalability and Extensibility

Localizing the benchmark to new languages or cultures is straightforward via substitution of entity lists, eliminating the need for costly manual annotation or curation. This property supports ongoing and future scaling as LLMs reach new markets and locales.

Diagnostic Resolution

By forcing LLMs to engage in adversarial settings where subtlety, inference, and high-context cultural knowledge are necessary, the approach exposes weaknesses not revealed by static QA or cloze benchmarks. It also provides interpretable, fine-grained diagnosis through game logs and voting patterns.

Notable Quantitative Results

Non-spy win rates are dominated by spies making incorrect guesses rather than being voted out, especially among stronger models (e.g., 74.5% of non-spy wins via incorrect spy guesses).
Spy win rates as a function of scenario/language drop by up to 35 percentage points moving from generic English to Egyptian Arabic food rounds.
Vote entropy and dispersion show higher "voting chaos" when stronger models play the spy, indicating emergent capabilities in manipulating multiagent dynamics.

Future Directions

Expansion to additional languages, heterogeneous agent pools, and more varied cultural proxies is both technically feasible and in progress. The methodology could be extended to complex multi-agent games or interactive negotiation/debate settings for even richer capability probing. Systematic evaluation on other aspects of "culture," such as humor, etiquette, or folklore, will provide deeper understanding of LLMs’ true global generalization.

Conclusion

This work establishes dynamic, multiplayer, culturally nuanced social deduction games as an incisive, practical metric for authenticating LLMs’ cross-linguistic and cross-cultural competence. The results underscore persistent deficits in non-English, localized contexts, even among frontier models, and encourage the development and assessment of models better aligned with the linguistic and cultural realities of a global user base. The benchmark's scalability and resistance to saturation make it a strong candidate for ongoing, forward-looking LLM evaluation.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game

Summary

Introduction and Motivation

Methodology: Multicultural, Multilingual Spyfall Design

Game Adaptation and Dynamic Interactivity

Evaluation Strategy

Key Empirical Findings

Alignment with Human Preference Benchmarks

Sharp Proficiency Drop in Non-English Cultural Contexts

Distinct Model Behavior and Voting Dynamics

Error Modes and Model Weaknesses

Theoretical and Practical Implications

Saturation and Data Leakage Resistance

Scalability and Extensibility

Diagnostic Resolution

Notable Quantitative Results

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Collections

Tweets

Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game

Summary

Dynamic Multicultural Benchmarking of LLMs via Multilingual Social Deduction Games

Introduction and Motivation

Methodology: Multicultural, Multilingual Spyfall Design

Game Adaptation and Dynamic Interactivity

Evaluation Strategy

Key Empirical Findings

Alignment with Human Preference Benchmarks

Sharp Proficiency Drop in Non-English Cultural Contexts

Distinct Model Behavior and Voting Dynamics

Error Modes and Model Weaknesses

Theoretical and Practical Implications

Saturation and Data Leakage Resistance

Scalability and Extensibility

Diagnostic Resolution

Notable Quantitative Results

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Collections

Tweets