- The paper introduces a dynamic benchmarking framework using a multilingual adaptation of Spyfall to assess LLMs' cultural and linguistic capabilities.
- It employs turn-based rounds with role-specific interactions and localized entity pools in Indonesian, Simplified Chinese, and Egyptian Arabic to evaluate model performance.
- Empirical findings reveal pronounced performance drops in non-English scenarios, underscoring key weaknesses in LLMs' cultural understanding and inference.
Dynamic Multicultural Benchmarking of LLMs via Multilingual Social Deduction Games
Introduction and Motivation
The evolution of LLMs toward advanced multilingual and multicultural understanding exposes limitations in traditional, static NLP benchmarks. These benchmarks are vulnerable to both training-data overlap ("leakage") and domain saturation, curbing their ability to robustly distinguish fine-grained real-world capabilities. "Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game" (2601.09017) addresses these gaps by deploying a turn-based, multilingual adaptation of the social deduction game Spyfall as a dynamic benchmarking framework. The focus is on probing not only linguistic competence but also deep, contextually situated cultural knowledge—by tasking models to engage in strategic, high-stakes inference and dialog in contextually rich, non-English environments.
Methodology: Multicultural, Multilingual Spyfall Design
Game Adaptation and Dynamic Interactivity
The benchmark adapts Spyfall for LLM play in three major respects:
- Turn-based, multi-phase structure: To accommodate LLM latency constraints and enforce controlled, observable reasoning sequences, the authors implement a deterministic, turn-limited cycle: (1) Round Robin Q&A; (2) Free Cycle for further interrogation and strategic voting; (3) Opportunity for the spy to guess the target entity.
- Culturally enriched, multilingual entity pools: Instead of only generic English locales, the game incorporates locally specific places and foods in three target languages (Indonesian, Simplified Chinese, Egyptian Arabic), with 30 localized entities per scenario.
- Prompt engineering and structured I/O: Models receive full context, history, and detailed, language-specific format instructions. Invalid outputs or rule-breaking automatically trigger losses, operationalizing compliance and prompt-following as part of model capability.
Evaluation Strategy
Benchmarked models include six leading LLMs encompassing strong proprietary models (Gemini 2.5-Pro/Flash), strong and mid-sized open models (Qwen3-30B, Qwen3-8B, Gemma-12B, Llama3.1-8B), tested extensively (9,000 games) using the Bradley-Terry paired comparison model for ranking. Performance is decomposed by entity class (generic, local location, local food), by language, and by player role (spy vs. non-spy), with win rate, information leakage, and voting behavior as core metrics.
Key Empirical Findings
Alignment with Human Preference Benchmarks
Model capability rankings in Multicultural Spyfall exhibit strong concordance with those from Chatbot Arena, an established human-preference-based evaluation platform, implying that dynamic, adversarial dialog closely tracks general model skill.
Sharp Proficiency Drop in Non-English Cultural Contexts
A core result is the substantial degradation of nearly all LLMs in non-English, culturally specific scenarios. Even top-tier models exhibit significantly higher frequency of strategic errors, rule discrepancies, or outright entity leaks when reasoning in Indonesian, Egyptian Arabic, and Simplified Chinese as compared to English. The impact is accentuated in food-related (vs. location) scenarios.
- For instance:
- Gemini models, which lead the rankings, maintain 0% information leakage in all scenarios, but their win rate as spies drops sharply in Egyptian Arabic food and location rounds.
- Llama3.1-8B demonstrates very high leakage rates (up to 48%), especially in Indonesian, often inadvertently disclosing the target entity.
Distinct Model Behavior and Voting Dynamics
- Spy vs. Non-Spy roles: Models show asymmetric competence depending on their assigned role. Top models excel at blending in as spies, manipulating voting to avoid detection, especially via subtle generic responses and vote dispersion. In weaker models, spies are reliably detected and voted out due to behavioral artifacts or unconvincing answers.
- Language compliance: Most models adhere strongly to the target language, except in dialectal settings (Egyptian Arabic); here, models frequently slip into Modern Standard Arabic even after being instructed to stick to the dialect, likely reflecting training data imbalances.
- Entity guessability and entropy: Local food entities in Egyptian Arabic and Indonesian prove most difficult to infer, with high-entropy distributions of guesses and low spy accuracy, underlining present LLMs' lack of regional cultural grounding.
Error Modes and Model Weaknesses
Qualitative error analysis highlights failure cases such as:
- Spies giving inappropriately generic or contextually inaccurate answers, failing to grasp local semantic nuances.
- Non-spies playing as weaker models misunderstanding cultural references or ignoring game rules, leading to high leakage or incorrect votes.
- Spy detection often hinges on specific entity knowledge (e.g., failing to identify "Binus" as a renowned private university in Jakarta), with strategic "fishy" questions betraying the spy to attentive non-spy models.
Theoretical and Practical Implications
Saturation and Data Leakage Resistance
This framework demonstrates robustness against data contamination; it is inherently difficult if not impossible for a static training set to encode the combinatorial and interactive game histories, thus overcoming a critical weakness of existing static benchmarks.
Scalability and Extensibility
Localizing the benchmark to new languages or cultures is straightforward via substitution of entity lists, eliminating the need for costly manual annotation or curation. This property supports ongoing and future scaling as LLMs reach new markets and locales.
Diagnostic Resolution
By forcing LLMs to engage in adversarial settings where subtlety, inference, and high-context cultural knowledge are necessary, the approach exposes weaknesses not revealed by static QA or cloze benchmarks. It also provides interpretable, fine-grained diagnosis through game logs and voting patterns.
Notable Quantitative Results
- Non-spy win rates are dominated by spies making incorrect guesses rather than being voted out, especially among stronger models (e.g., 74.5% of non-spy wins via incorrect spy guesses).
- Spy win rates as a function of scenario/language drop by up to 35 percentage points moving from generic English to Egyptian Arabic food rounds.
- Vote entropy and dispersion show higher "voting chaos" when stronger models play the spy, indicating emergent capabilities in manipulating multiagent dynamics.
Future Directions
Expansion to additional languages, heterogeneous agent pools, and more varied cultural proxies is both technically feasible and in progress. The methodology could be extended to complex multi-agent games or interactive negotiation/debate settings for even richer capability probing. Systematic evaluation on other aspects of "culture," such as humor, etiquette, or folklore, will provide deeper understanding of LLMs’ true global generalization.
Conclusion
This work establishes dynamic, multiplayer, culturally nuanced social deduction games as an incisive, practical metric for authenticating LLMs’ cross-linguistic and cross-cultural competence. The results underscore persistent deficits in non-English, localized contexts, even among frontier models, and encourage the development and assessment of models better aligned with the linguistic and cultural realities of a global user base. The benchmark's scalability and resistance to saturation make it a strong candidate for ongoing, forward-looking LLM evaluation.