Reference Games as Testbeds

Updated 19 January 2026

Reference games are strictly controlled communication tasks that enable the study of concept coordination, uncertainty, and adaptive signaling.
They span diverse domains—images, spatial reasoning, and procedural generation—and utilize rigorous metrics like policy adaptation and convergence of conventions.
Architectural platforms and controlled evaluation metrics make these games ideal for benchmarking AI performance and diagnosing model adaptation challenges.

Reference games constitute a family of strictly controlled, goal-directed communication tasks in which a "speaker" agent produces an utterance or signal to help a "listener" identify a hidden referent from among a small, closed set of candidates. Their formal structure, domain flexibility, gradable complexity, and rigorous measurability render them uniquely suited as testbeds for cognitive, linguistic, and computational research into concept coordination, grounding, uncertainty, and model adaptation. This article synthesizes key formalizations, experimental methodologies, and findings from the recent literature to illuminate the multifaceted role of reference games as diagnostic testbeds across domains.

1. Formalization and Taxonomy of Reference Games

Reference games have classic origins in the signalling-game paradigm, as defined by Lewis (1969), with the core tuple $(S, M, G)$ comprising a set of referents $S$ , a message space $M$ , and a guess/action space $G$ (Momentè et al., 20 Feb 2025). Each round features: (i) hidden referent selection, (ii) utterance choice by the speaker, and (iii) referent identification by the listener—often with an explicit correctness reward.

The design can be extended across domains:

Image-based reference games: Agents describe distinguishing features (attributes) of visual targets, often using perceptual encoders $\phi:\mathcal{X}\to[0,1]^{|A|}$ (Corona et al., 2019).
Simple language games: Reference games with noun–adjective pairs (as in Codenames) allow modeling of associative meaning through collocation, embedding, or knowledge-graph statistics (Shen et al., 2018).
Spatial and logical games: Tic-Tac-Toe–style multi-move games probe strategic reasoning, opponent modeling, and spatial planning using formal board positions, transition functions, and winning patterns (Mishra et al., 11 Jun 2025).
Procedural generation and rule construction: Reference games extend to generator–validator frameworks where the task is to communicate or specify content according to tightly controlled parameters (Khalifa et al., 27 Mar 2025).

Reference games permit parameterization of message cost, response uncertainty, and domain complexity, supporting both by-hand and programmatic generation of large, diverse test suites (Corona et al., 2019, Mishra et al., 11 Jun 2025).

2. Key Methodological Approaches and Metrics

The highly structured nature of reference games allows for rigorous methodology, quantitative metrics, and statistical modeling.

Policy and adaptation: Speakers often employ adaptive policies $\pi_S(s_k,a_k)$ that condition on both perceptual difference and inferred listener embeddings $h_{k-1}$ to select discriminative signals (Corona et al., 2019).
Feedback and learning dynamics: Metrics track convergence of conventions (e.g., word entropy $H(P)$ , utterance length reduction $\Delta_r$ ), clustering behaviors, and feedback-dependent dropout of syntactic units (Hawkins et al., 2019).
Associativity and pragmatic modeling: Association scores $s_{n,a}$ (from bigrams, embeddings, graphs) quantify the underlying lexical resources, while literal vs RSA-style pragmatic agents are compared using predictive likelihood, top-answer accuracy, and rank-correlation $\rho$ (Shen et al., 2018).
Uncertainty and clarification: Sampling-based and softmax-based confidence proxies (e.g., consistency-based $conf_i$ and MSP) enable direct assessment of model calibration and its relation to clarification request rates (Ali et al., 12 Jan 2026).
Reasoning and strategic play: Pass@1 accuracy on programmatically generated move-selection puzzles contrasts with baseline reasoning on math and logic benchmarks (Mishra et al., 11 Jun 2025).
Quality, diversity, controllability in PCG: Evaluation metrics $R_q, R_d, R_t$ establish cross-method comparability for procedurally generated content, reflecting both feasibility and solution-space coverage (Khalifa et al., 27 Mar 2025).

Quantitative studies consistently show that reference game metrics—rewards, convention stability, semantic similarity, or strategic win/block/fork accuracy—are tractable, interpretable, and discriminative of agent ability (Hawkins et al., 2019, Mishra et al., 11 Jun 2025, Momentè et al., 20 Feb 2025).

3. Applications as Benchmarks and Testbeds in AI Evaluation

Reference games perform a dual function: benchmarking fundamental capabilities (efficiency, concept alignment, interactive reasoning) and acting as programmable environments for directed model diagnosis.

LLM and VLM benchmarking: Reference games discriminate LLMs more effectively than static QA benchmarks. Interactive games yield size-dependent performance gains of 40–80%, compared to 20–30% for static benchmarks; correlations with working memory and theory-of-mind tasks ( $\tau\approx0.6$ ) further illustrate their diagnostic power (Momentè et al., 20 Feb 2025).
Alignment and adaptation tests: They rigorously expose both the strengths and failure modes of models in handling conceptual misalignment, listener adaptation, or perceptual mismatch. For example, LSTM-embedded agent histories enable rapid clustering and reward maximization (Corona et al., 2019).
Uncertainty–clarification alignment: Model confidence and systematic clarification request behaviors can be traced and measured, revealing that, even in highly controlled setups, SOTA vision–LLMs frequently fail to issue task-relevant clarifications or calibrate their confidence appropriately (Ali et al., 12 Jan 2026).
Procedural and rule generation: In PCG testbeds, reference-game abstractions support comparison of search-based, evolutionary, or ML-based content generators against metrics of quality, diversity, and designer-goal controllability (Khalifa et al., 27 Mar 2025).

Reference games are robust to training data contamination due to the finite, combinatorial interaction space, making them suitable for repeatable, fault-isolating empirical studies (Momentè et al., 20 Feb 2025).

4. Dynamics of Convention Formation, Efficiency, and Interaction

Repeated-play studies illuminate the process by which agents (human or artificial) negotiate, stabilize, and streamline communicative conventions.

Convention emergence: Word choice rapidly converges on discriminative, context-sensitive labels, with utterance lengths dropping from $\sim$ 7.5 to $\sim$ 2.5 tokens over brief exposures (Hawkins et al., 2019).
Stability vs divergence: Within dyads, semantic similarity and entropy measures confirm conventionalization; across dyads, divergence persists, with path-dependent symmetry breaking and multiple equilibria (Hawkins et al., 2019).
Syntax and semantic pruning: Dropout of syntactic units occurs in clusters (mean path $\sim$ 2.77), exceeding random or function-word-biased baselines, and aligns with positive feedback. This supports models that trace survival of diagnostic open-class words and contextual adaptivity (Hawkins et al., 2019).
Strategic reasoning breakdown: Even trivial two-player games such as Tic-Tac-Toe variants highlight deficiencies in multi-step reasoning, blocking threat anticipation, and optimal move selection in state-of-the-art models—a marked contrast with high performance on STEM benchmarks (Mishra et al., 11 Jun 2025).

These findings recommend future models to incorporate context-sensitive informativity, feedback-driven update rules, stochasticity, and memory-based stability criteria (Hawkins et al., 2019).

5. Architectural and Platform Considerations for Testbed Design

The transition from classic reference games to full-fledged diagnostic platforms involves architectural choices and API design facilitating white-box testing, intervention, and compositional scenario construction.

ToyBox and semantic state control: By reimplementing Atari games with semantic variables and explicit intervention interfaces, ToyBox expands reference-game environments to support direct state manipulation, test function definition $T:S \times A^* \rightarrow \{\mathrm{pass},\mathrm{fail}\}$ , and systematic coverage evaluation. Examples include spatial symmetry tests, last-brick elimination, ballistic recovery, and tunnel exploitation (Foley et al., 2018).
Open-source environments and APIs: Platforms such as OpenAI Gym, PettingZoo, Ludii, OpenSpiel, PCGRL, and GDMC organize environments into single-/multi-agent, cooperative/adversarial, and creative design modalities with standardized interfaces, promoting rapid benchmarking and systematic testing (Hu et al., 2023).
Control of complexity and variation: Programmatic generation pipelines, modular representation schemes (content/control spaces), and variable evaluation metrics allow for scalable, domain-independent challenge design and controlled difficulty grading (Khalifa et al., 27 Mar 2025, Mishra et al., 11 Jun 2025).
Best practices: Stable prompts, robust output parsing, clear scoring scripts, and reproducible code bases are integral to reliable testbed deployment in LLM and RL evaluation (Momentè et al., 20 Feb 2025, Foley et al., 2018).

These architectural advances serve to lower barriers to entry, facilitate curriculum construction, and bridge the gap between black-box score evaluation and white-box behavioral inspection (Hu et al., 2023).

6. Limitations, Challenges, and Future Trends

While reference games offer unmatched precision and control, their scope and applicability must be contextualized by known limitations and emerging trends.

Domain transfer limitations: Success in tightly bounded reference games (e.g., color-grid identification, Codenames-style association) does not necessarily imply competence in open-dialogue or real-world applications (Ali et al., 12 Jan 2026).
Representational constraints: The effectiveness of mutation and search strategies in PCG or RL testbeds is sensitive to representation choices; complex genotypes such as extensive 2D slices (e.g., Mario levels) pose rugged fitness landscapes impeding convergence (Khalifa et al., 27 Mar 2025).
Automated challenge generation: Manual test definition remains a bottleneck; integrating symbolic logic or property-based test languages may facilitate scalable family construction for semantic requirements (Foley et al., 2018).
Calibration and clarification: Current confidence proxies (consistency, max-softmax) are coarse and insufficiently predictive of when a model should request clarification, underscoring the need for more granular uncertainty estimation (Ali et al., 12 Jan 2026).
Multi-agent and open-world extensions: Many reference-game testbeds remain limited to single-agent or simple multi-agent settings. Richer open-world and multi-agent dynamics will be necessary to simulate scenarios demanding lifelong learning, adaptive reasoning, and co-creative design (Hu et al., 2023).

Trends suggest a movement toward higher generality, seamless interactivity, and the inclusion of cognitive and social-emotional challenge dimensions—informed by the discriminative and diagnostic strengths of reference games as core testbeds for evolving AI systems (Momentè et al., 20 Feb 2025, Hu et al., 2023).

In summary, reference games provide a rigorously quantifiable, adaptable, and multifaceted framework for evaluating conceptual alignment, communicative efficiency, strategic reasoning, content generation, and uncertainty calibration in both artificial and human agents. Their role as testbeds continues to shape methodologies and benchmarks at the intersection of cognitive science, computational linguistics, reinforcement learning, and AI architecture research.