Clembench: LLM Evaluation via Dialogue Games
- Clembench is a dialogue game-based benchmarking framework that systematically evaluates LLMs using rule-driven, multi-turn interactions.
- It employs modular components and dynamic scenarios to test capabilities like instruction following, strategic reasoning, and contextual understanding.
- The framework quantifies performance using composite metrics such as clemscore and supports both text-based and multimodal evaluations across multiple languages.
Clembench is a benchmarking framework and paradigm for evaluating LLMs via interactive dialogue games, designed to probe capabilities including instruction following, strategic reasoning, and conversational grounding. Unlike static reference-based benchmarks or user-driven preference-based evaluations, clembench uses repeatable, rule-governed, multi-turn scenarios mediated by a "Game Master", systematically challenging models as situated language agents. The framework supports text-based and multimodal games, is extensible across languages and environments, and quantifies performance using composite metrics such as "clemscore". Clembench operates as both a research instrument and ongoing leaderboard, providing actionable data for model selection and development.
1. Paradigm: Dialogue Game-Based Evaluation
Clembench introduces a third paradigm for LLM evaluation: dialogue game-based assessment (Schlangen et al., 11 Jul 2025). This approach is distinguished from reference-based methods (predetermined inputs/outputs scored by comparison to reference answers) and preference-based systems (such as LM-Arena, with user voting in open-ended contexts). Dialogue game-based evaluation leverages structured, repeatable interactions, placing models within tightly specified multi-turn games governed by explicit rules.
In clembench, models self-play in various roles (e.g., Describer/Guesser, Instruction Giver/Follower), with each move validated and scored by the Game Master. This paradigm enables controlled testing of strategic goal orientation, rule adherence, and the accretion of context over turns, integrating strengths of both control (reference-based) and ecological validity (preference-based) in evaluation settings (Chalamalasetti et al., 2023).
2. Framework Architecture and Implementation
Clembench comprises two modular components: clemcore and clembench proper (Schlangen et al., 11 Jul 2025).
- clemcore: A pip-installable, backend-agnostic library mediating LLM access, experiment management, turn-wise logging, and automated scoring. The clemcore abstraction enables integration of proprietary APIs (OpenAI, Anthropic, Mistral), open-source inference stacks (Hugging Face, vLLM, llama.cpp), and bespoke engines via extensible interfaces.
- clembench: A repository containing benchmarked games (14 text-only, 5 multimodal as of mid-2025), model registry specifications, and pre-compiled game instances. Model registration uses structured JSON entries (including model id, backend, EOS tokens) for reproducibility.
Game logic and instance data are encoded in specification files (e.g., clemgame.json). Initiating an experiment requires only configuration and a single command, with results output as machine-readable logs and visualized via browser tools. The architecture is engineered for extensibility: new games (including environments like AI2-THOR), stimulus sets, backends, and languages can be integrated with minimal overhead.
3. Evaluation Methodology and Game Types
The clembench methodology centers on a suite of interactive dialogue games, each designed to probe distinct sub-capabilities (Chalamalasetti et al., 2023, Beyer et al., 31 May 2024). Notable exemplars include:
Game Type | Description | Key Capability Probed |
---|---|---|
Taboo | Describe word w/o forbidden lexicon; guesser attempts target | Lexical constraint adherence, strategic clue generation |
Wordle-like | Iterative guessing based on feedback | Incremental reasoning, rule-following, clue integration |
Drawing Instruction | Instruction giver/follower reconstruct image grid | Situational/world modeling, instruction execution |
Picture Reference | Generate referring expressions among distractors | Referential language, goal-directed communication |
Scorekeeping | Update common ground/private/shared slots | Discourse modeling, information alignment |
Each game scenario is instantiated dynamically (from prompt templates and parsing rules in clembench-2024 (Beyer et al., 31 May 2024)), enabling avoidance of static data contamination and facilitating ongoing, robust evaluation of new models and instances.
4. Scoring Metrics and Quantitative Assessment
Clembench employs a dual-layer scoring system, separating formatting compliance (rule-following) from qualitative gameplay performance (Chalamalasetti et al., 2023, Beyer et al., 31 May 2024, Schlangen et al., 11 Jul 2025).
- Quality Score (Q): Game-specific, measuring strategic success on a 0–100 scale (e.g., for Taboo where is moves taken; F1-score for drawing tasks).
- % Played (F): Proportion of episodes where formatting/rule adherence was satisfactory.
- Aggregated Score (O):
A further macro-aggregation, the clemscore, summarizes performance across games:
Human expert performance averages near 87 (out of 100) on clembench-2024 (Beyer et al., 31 May 2024); leading LLMs fall substantially below this upper bound. For multi-turn binary slot tasks, summary metrics including Cohen's are used to adjust for strong chance performance. Rankings across models demonstrate robust stability over time (Kendall's in longitudinal studies for clembench-2024 (Beyer et al., 31 May 2024)).
5. Multilingual and Instance-Independent Extension
Clembench-2024 introduced extensibility to multilingual evaluation: prompt templates and parsing are translated, enabling scenario reproduction across languages (German, Italian, Japanese, Brazilian Portuguese, Simplified Chinese, etc.) (Beyer et al., 31 May 2024). Commercial models maintain formatting compliance in multiple languages, but quality scores can degrade in less well-supported ones, as observed for GPT-4 in Simplified Chinese.
Games are not tied to static datasets; dynamic generation via programmatic templates ensures evaluations are immune to contamination by training data, and allows flexibility in the face of rapid model advances (Beyer et al., 31 May 2024). This design supports tracking performance evolution and inter-model comparisons without the confound of repeated exposure to a fixed task set.
6. Extensibility, Leaderboard, and Community Utility
The framework is engineered for rapid extensibility (Schlangen et al., 11 Jul 2025). New instances can be sampled or generated programmatically (e.g., using WordNet for Taboo target words). New games require only a specification file and implementation of logic/scoring (simple games implemented in 2–3 hours; complex environments supported). Language adaptation is straightforward by modifying prompts and parsing rules.
Clembench supports a public leaderboard (since 2023), with continuous updates reflecting advances in both proprietary and open-weight models. The transcript browser and documentation facilitate in-depth diagnostics and model analysis. Benchmarks are reference-free, yet yield granular, actionable metrics for practitioners. When compared to preference-based (Chatbot Arena) and reference-based (HELM) paradigms, clembench’s ranking correlates more strongly with preference-based use cases (Kendall's ) than with static reference benchmarks () (Beyer et al., 31 May 2024).
7. Limitations and Future Directions
Current limitations include restriction to English in standard benchmarks (though multilingual support is active), a modest number of game instances due to cost, strict formatting criteria causing aborts, and limited support for multimodal interaction (in development) (Chalamalasetti et al., 2023, Beyer et al., 31 May 2024). Future work entails expansion to additional languages, extended game complexity (including multimodal and simulated environments), greater community contribution, and backend standardization (aligning with model.yaml formats) (Schlangen et al., 11 Jul 2025).
A plausible implication is that clembench may serve as the diagnostic engine within closed-loop development cycles for conversational agents, providing nuanced, robust feedback on interactive, goal-oriented capabilities, and informing model selection and fine-tuning for applied deployments.
Clembench constitutes a sophisticated, evolving benchmark for LLM evaluation, harnessing interactive dialogue games to probe situational understanding and agentive performance. Its architecture ensures modularity, extensibility, and cross-paradigmatic relevance, establishing clembench as a complementary tool alongside reference and preference-based evaluations, and providing a durable foundation for rigorous, context-sensitive assessment of LLMs.