Referential Games: Symbolic vs. Pixel Input

Updated 3 April 2026

The paper shows that referential games enable neural agents to develop structured communication protocols shaped by input modality and discrete bottlenecks.
It details architectures employing CNNs for pixel inputs and MLPs for symbolic inputs, trained with a mix of supervised and reinforcement learning techniques.
Empirical findings indicate that protocol compositionality and generalization vary with input type, highlighting challenges in scaling emergent language systems.

Referential games with symbolic and pixel input constitute a core experimental paradigm for studying the emergence and properties of artificial communication protocols under diverse input modalities. In this context, agents—generally neural networks—are tasked with cooperative, information-theoretic games in which one or both parties encode or decode referents via variable-length messages constrained by a discrete bottleneck, with raw or structured sensory input. The structure, efficiency, and compositionality of emergent languages in these settings are shaped directly by the input modality, architecture, training protocol, and explicit or implicit information bottlenecks.

1. Game Formalization and Modalities

Referential games in this domain are typically played between two agents, a sender (speaker) and a receiver (listener). A canonical formalization is as follows:

Let $O$ denote the set of objects (referents).
The sender observes a target $o_s \in O_S$ , derived from $O$ via either symbolic encoding (e.g., attribute vectors, natural language definitions) or pixel-based embedding (e.g., images processed by convolutional networks) (Lazaridou et al., 2018, Evtimova et al., 2017).
The receiver either receives a candidate set (discriminative games), a set of symbolic descriptions, or, in multimodal games, a distinct view such as raw text when the sender observes pixels (Evtimova et al., 2017).
The message space $S$ is typically a discrete space, e.g., sequences of tokens from an alphabet or binary strings of configurable length and vocabulary.
The objective for both agents is to maximize communication success, i.e., correct identification of the referent, by jointly optimizing a reward or cross-entropy signal, possibly subject to reinforcement learning constraints and entropy regularization.

Critically, games may be unidirectional (single-shot), bidirectional (multi-step dialog), or hierarchical, and may vary in the number of distractors, message length constraints, and whether sender and receiver share modality or operate across modalities (symbolic → symbolic, pixel → symbolic, etc.) (Evtimova et al., 2017).

2. Architectures and Training Regimes

Sender and receiver networks are often composed of modality-specific encoders and a shared or compositional decoder:

Symbolic input encoders tend to be shallow MLPs or embedding averages (GloVe, word2vec) for structured attribute vectors or text.
Pixel input encoders employ deeper CNN backbones, sometimes with attention, to extract semantically meaningful features (Lazaridou et al., 2018, Evtimova et al., 2017, Gupta et al., 2021).
Message generation employs LSTMs or shallow affine heads (for binary vector output), often factorized across tokens/bits, and sampling is sometimes made differentiable by Gumbel-Softmax with straight-through estimators (Denamganaï et al., 2020, Gupta et al., 2021).
Receivers typically combine message embeddings (e.g., via LSTM/GRU or transformer) with visual or symbolic referent candidates, either scoring them (cosine similarity, softmax over dot products) or classifying into the referent space.

Agent pairs are trained with a mix of supervised classification loss, policy-gradient reinforcement learning (REINFORCE), entropy bonuses, and auxiliary baselines (for variance reduction) (Evtimova et al., 2017, Lazaridou et al., 2018). Training often involves curriculum components, such as adaptive stopping or variable dialog length, to capture the dynamic nature of referential communication.

3. Metrics for Language Structure and Generalization

Evaluation of emergent protocols combines success on the nominal task with structural and generalization metrics:

Metric	Definition/Implementation	Purpose
Topographic Similarity (TS)	Negative Spearman rank correlation between pairwise distances in meaning (e.g., attribute space) and message spaces (e.g., Levenshtein), computed over a batch of referents and messages (Lazaridou et al., 2018, Denamganaï et al., 2020).	Measures alignment between input similarity and message similarity (proxy for compositionality).
Zero-shot Compositional Accuracy	Accuracy on held-out or OOD (out-of-distribution) attribute combinations or referents not seen during training, controlled by designed splits (Denamganaï et al., 2020, Lazaridou et al., 2018, Evtimova et al., 2017).	Assesses systematic generalization.
Lexicon Size/Novelty	Number or proportion of unique/novel messages, especially on novel inputs (Lazaridou et al., 2018).	Proxy for language productivity
Communication Bandwidth/Mutual Info	$I(M;X) = H(M) - H(M\|X)$ , mutual information between sender view and messages (Evtimova et al., 2017).	Quantifies information transfer
Matching Accuracy	Fraction of trials in which the receiver correctly identifies the referent, optionally at different $K$ (Accuracy@K) (Evtimova et al., 2017, Gupta et al., 2021, Clark et al., 2021).	Direct measure of functional success

Notably, high TS does not guarantee generalization (zero-shot), especially in visual/pixel regimes (Denamganaï et al., 2020).

4. Empirical Findings and Modality Effects

Symbolic Input

Agents operating on structured (symbolic) inputs reliably invent protocols with nontrivial lexicon size, partial compositionality, and strong topographic similarity between input and message spaces. Empirical results consistently show high in-domain accuracy, increased productivity with greater message length, and relatively robust generalization to held-out attributes when the underlying compositional structure is preserved (Lazaridou et al., 2018, Denamganaï et al., 2020).

Pixel Input

For raw-pixel regimes, performance and structure are more sensitive to task setup and architectural choices:

Communication accuracy and lexicon size can approach symbolic levels on tasks with abundant structure in the pixel space (e.g., position-encoding games) (Lazaridou et al., 2018).
Compositionality (TS) is weakly or inconsistently tied to batch size, channel capacity, or message length when compared to symbolic inputs. CNNs with batch normalization tend to regularize the emergent protocols, sometimes diluting information bottleneck effects (Denamganaï et al., 2020).
As the input entanglement increases (e.g., pixel images with little attribute disentanglement), protocols become ad-hoc and less compositional, often overfitting to incidental features of the data or game (Lazaridou et al., 2018, Denamganaï et al., 2020).
Multi-step, bidirectional games with multimodal access (e.g., sender: image, receiver: text bag-of-words) can support systematic protocol emergence and improved OOD transfer by leveraging richer message spaces and adaptive dialog length (Evtimova et al., 2017).

Bandwidth and Message Structure

Increasing discrete channel capacity (either by message length or vocabulary) generally aids transfer and generalization, but only up to a point. Overcomplete channels facilitate disentangling and compositionality in the Gumbel-Softmax setting, although excessive vocabulary size can undermine compositional reuse (Denamganaï et al., 2020).

5. Specialized Variants and Practical Implementations

Several notable variants adapt the referential game formalism to richer practical scenarios:

PatchGame: Communicates via discrete, ranked patch-symbols representing salient image regions; messages are sequences of sampled tokens indicating mid-level textures or object parts. The protocol that emerges is semantically meaningful (“visual words”), supports rapid matching across augmented views, and provides acceleration for downstream ViT models by enabling inference on selected patches only (Gupta et al., 2021).
Multimodal Multi-Step Games: Sender and receiver communicate via binary vectors across image and text modalities, taking bidirectional turns until the receiver signals sufficient information has been gathered. Rich protocols emerge, mirroring variable-length, progressively specific information exchange seen in human dialogs. Attention mechanisms further support cross-domain generalization (Evtimova et al., 2017).
Iconary: Expands the referential game into explicitly multimodal, human-AI settings: the "Drawer" (agent or human) composes icon-based drawings of phrases, and the "Guesser" attempts to identify the phrase via text. Both roles are trainable via sequence-to-sequence LMs (e.g., T5), with explicit scaffolding to handle OOV generalization, fill-in-the-blank reasoning, and variable-length, iterative exchange (Clark et al., 2021).

6. Theoretical Insights and Open Questions

The degree of compositionality in emergent protocols is tightly controlled by the structure and disentanglement of the input representation. Symbolic regimes systematically produce stable, compositional encodings; pixel regimes typically yield more fragile and idiosyncratic codes, unless explicit architectural or training biases are introduced (Lazaridou et al., 2018, Denamganaï et al., 2020). Overcomplete communication channels and strong bottlenecks can support (but do not guarantee) compositionality, especially when differentiable estimators like Gumbel-Softmax are used, but the mapping from high TS to robust generalization is not reliable in pixel or multimodal environments (Denamganaï et al., 2020).

Significant challenges remain in engineering inductive biases or environmental pressures that foster robust, human-like compositionality—especially with raw, entangled sensory inputs. Open questions include methods for disentanglement pretraining, multi-task or curriculum learning regimes, and adaptive dialog protocols that exploit variable-length, bidirectional signaling. Emerging results in Iconary and PatchGame suggest that communication across modalities via structured discrete message spaces is feasible and practically beneficial, but that performance gaps to humans (especially on generative, non-literal, and abstract tasks) persist and highlight fertile ground for future study (Clark et al., 2021, Gupta et al., 2021).

7. Summary Table: Core Contrasts Across Input Modalities

Aspect	Symbolic Input	Pixel Input
Protocol Compositionality	Robust, high TS, generalization strong	Sensitive, often low TS, generalization inconsistent
Role of Channel Capacity	Bottleneck and overcompleteness boost structure	Overcomplete channels help, but to a lesser extent; vocab size effects mixed
Task Generalization	Predictable, strong on systematic tests	Highly sensitive to train/test split, easily overfits
Metrics (TS ↔ Generalization)	Strongly correlated	Weak or no correlation; TS cannot proxy zero-shot performance

These patterns underline the crucial dependence of emergent communication structure on the statistics and encoding of the input space, and the enduring challenge of scaling artificial referential games to settings that demand robust, generalizable, and interpretable language emergence (Lazaridou et al., 2018, Denamganaï et al., 2020, Evtimova et al., 2017, Gupta et al., 2021, Clark et al., 2021).