Metrics for Emergent Communication

Updated 26 September 2025

Metrics for emergent communication are quantitative measures that assess the structural and functional properties of artificial languages in multi-agent systems.
They encompass basic statistics like message accuracy and vocabulary size as well as advanced measures such as compositionality, context independence, and causal influence.
These metrics guide research by diagnosing language emergence, evaluating model interpretability, and informing design choices for efficient agent communication.

Metrics for emergent communication provide a quantitative foundation for understanding how artificial agents develop, utilize, and optimize information exchange protocols in multi-agent and human-agent settings. As this field spans deep reinforcement learning, cognitive modeling, and information theory, the development and application of such metrics have become central to evaluating the emergence, structure, function, and interpretability of artificial communication systems. Metrics range from straightforward statistics like message accuracy and vocabulary size to sophisticated measures of compositionality, context independence, morphological structure, and pragmatic utility, offering rigorous insight into the mechanisms underpinning artificial language emergence.

1. Fundamental Categories of Metrics

Research on metrics for emergent communication delineates evaluation along multiple linguistic and functional axes. A comprehensive survey (Peters et al., 4 Sep 2024) categorizes metrics into four principal dimensions:

Morphological metrics quantify word formation, vocabulary diversity, code reuse (e.g., compression) and ambiguity (e.g., perplexity, distinct appearances, active word count).
Syntax metrics address grammatical structure, typically through unsupervised grammar induction and syntactic distinctness of emergent messages.
Semantic metrics capture the degree to which emergent codes are meaning-grounded, compositional, and consistent. These include measures such as topographic similarity, tree reconstruction error, disentanglement, and mutual information.
Pragmatic metrics focus on the functional utility of communication—its coordination, influence on behavior, efficiency (sparsity), and symmetry within agent populations.

The following table summarizes major metric families, their operational definitions, and typical measurement spaces:

Metric Family	Core Concept	Example Implementation
Morphological	Token structure & diversity	DA, Perplexity, Message Distinctness
Syntax	Formal structure/grammar	UGI/CGI tree induction, parse tree comparison
Semantic	Meaning, compositionality	TopSim, TRE, posdis, bosdis, context independence
Pragmatic	Utility & coordination	Speaker Consistency, Causal Influence of Communication

Each axis provides a distinct but complementary lens for understanding emergent protocols, reflecting the multifaceted nature of language.

2. Metrics for Compositionality and Semantic Structure

Compositionality metrics are a central focus, driven by the goal of distinguishing simple codebook mappings from communication systems that recursively and systematically encode complex concepts (Korbak et al., 2020, Boldt et al., 3 Jul 2024, Carmeli et al., 17 Mar 2024). Several classes have been operationalized:

Topographic Similarity (TopSim): The Spearman correlation coefficient ρ between pairwise distances in referent (object/concept) space and message (signal) space, typically using cosine similarities for entities and Levenshtein or edit distance for signals:

$\rho = \operatorname{Spearman}\bigl(\{-\cos(\vec{o}_i, \vec{o}_j)\}, \{\text{Edit}(m_i, m_j)\}\bigr)$

High ρ reflects systematic compositional encoding; similar meanings yield similar signals.

Bag-of-Symbols Disentanglement (BoSDis): Measures the mutual information between the presence of a symbol and particular attributes, averaged across symbols, to assess if each symbol uniquely encodes an attribute value (Gilberti et al., 7 Aug 2025).
Tree Reconstruction Error (TRE): Evaluates whether a compositional function $\widehat{f}_\varphi$ can be trained to reconstruct a mapping from derivations to messages:

$\varphi^* = \operatorname*{argmin}_\varphi \mathbb{E}_{d \in \mathcal{D}} \delta(f(d), \widehat{f}_\varphi(d))$

where $\delta$ is a distance function (e.g., cross-entropy), and $f$ is the ground-truth mapping. Low TRE indicates non-trivial compositionality (Korbak et al., 2020).

Context Independence: Typically computed through IBM Model 1-style alignments between symbols and concepts (Bogin et al., 2018), where a context-independence score $CI$ near 1 signals near one-to-one and context-invariant correspondences.
Best-Match Algorithm for Concept Mapping: Constructs a weighted bipartite graph between emergent words and natural language concepts, solves for a globally optimal matching (via the Hungarian algorithm), and provides both a global normalized compositionality score and an explicit translation map (Carmeli et al., 17 Mar 2024).
Fusionality (F-TopSim): Quantifies the tendency of protocols to merge grammatical attributes into single symbols, using $\Delta_{F-\text{TopSim}} = F\text{-TopSim} - \text{TopSim}$ as a measure for increased similarity when features are fused (Gilberti et al., 7 Aug 2025).

3. Metrics for Communication Efficiency, Robustness, and Pragmatics

Pragmatic and functional metrics evaluate task-related performance and the causal impact of communication.

Success Rate and Reward: Basic but indispensable, measuring the fraction of correctly completed referential/discriminative tasks (e.g., (Li et al., 2020, Carmeli et al., 2022, Karten et al., 2022)).
Speaker Consistency (SC): The mutual information between an agent’s message and its subsequent action:

$SC = \sum_a\sum_m p(a, m) \log \left( \frac{p(a, m)}{p(a)p(m)} \right)$

High SC identifies positive signaling but does not guarantee positive listening (Lowe et al., 2019).

Causal Influence of Communication (CIC): Directly assesses whether a message, when intervened, changes the receiver’s subsequent action or policy distribution. CIC relies on causal interventions, not just correlations.
Instantaneous Coordination: Mutual information between speaker’s message and listener’s next action, strengthens the causal diagnosis but suffers from confounds (averaging vs. state conditioning).
L1 and L2 Loss in Competitive Games: In partially competitive emergent settings, loss reduction (e.g., angular error in target selection) and fairness (sum of squared losses) become primary communication diagnostics (Noukhovitch et al., 2021).
Entropy and Message Length: Lower entropy in message distributions or adaptive message lengths (especially in multi-step referential games) correspond to more efficient, targeted dialogues (Evtimova et al., 2017).

4. Task-Specific and Structural Metrics

Many recent works tailor their metrics to structural or domain-specific phenomena:

HASLen and BPELen (Concatenativity): Extracted by segmenting emergent messages into morpheme-like units (via Harris Articulation Scheme or Byte-Pair Encoding) and averaging the number of symbols per message—lower values indicate greater concatenativity, mirroring natural language inflection (Gilberti et al., 7 Aug 2025).
Redundancy, Ambiguity, and Coverage: Quantified via perplexity, active word coverage, and rates of ambiguous translation in best-match mapping (Carmeli et al., 17 Mar 2024, Peters et al., 4 Sep 2024).
Sparsity/Efficiency: Fraction of time-steps with nonzero messages, e.g., ComSpar in human-agent teams, relates protocol efficiency to human cognitive load (Karten et al., 2022).
Symmetry Metrics: Inter-agent and within-agent divergence measures (using Jensen-Shannon divergence) assess whether the semantics expressed remain agent-invariant.

5. Metrics Beyond Shannon: Semantic and Causal Foundations

Recent proposals argue for semantic metrics, extending beyond classical information-theoretic approaches:

Semantic Information Measure: Defined using category-theoretical constructs such as representable copresheaves, encoding logical entailments and measuring “semantic surprise” (Thomas et al., 2022):

$S(x|c) = \sum_{\phi \in H_x} \pi(\phi|x, c) \log \frac{\pi(\phi|x, c)}{\pi(\phi)}$

This permits the quantification of conveyed meaning under logical models, not merely statistical uncertainty.

Semantic Reliability and Distortion: Evaluates the extent to which the intended meaning or causal explanation is preserved at the receiver, irrespective of bit-level errors (Thomas et al., 2022).

6. Pitfalls, Limitations, and Best Practices in Metric Design

Empirical studies caution against overreliance on any single metric. For example, high speaker consistency may persist in the absence of true inter-agent impact if network architectures share representations (“positive signaling” without “positive listening”) (Lowe et al., 2019). It is recommended to:

Use multiple, complementary metrics (e.g., combine SC, CIC, and architectural ablation).
Apply causal interventions to verify communicative effect.
Visualize communication and action distributions to check for spurious correlations.
Tailor metrics to both agent-only and human-in-the-loop settings for practical validation (Karten et al., 2022).

A unified framework for benchmarking across task types, agent architectures, and linguistic structures remains an open challenge (Peters et al., 4 Sep 2024).

7. Implications and Future Directions

The diversity of metrics illuminates key avenues for emergent communication research:

Advancing interpretability and systematicity by adopting metrics detecting compositionality, fusionality, and context independence, while mapping emergent symbols onto human-understandable concepts (Carmeli et al., 17 Mar 2024, Gilberti et al., 7 Aug 2025).
Bridging formal semantics and machine learning via tree reconstruction and semantic information metrics, aligning the emergent protocols with theoretical properties of natural languages (Korbak et al., 2020, Thomas et al., 2022).
Improving cross-domain transfer by correlating communicative efficiency metrics (e.g., TopSim, generalization accuracy) with performance in downstream tasks such as few-shot translation or embodied imitation (Li et al., 2020, Mu et al., 2023).
Designing interventions and inductive biases (e.g., cognitive mapping, attention, ease-of-articulation pressures) that steer learned languages toward natural or application-specific structures (Chen et al., 28 May 2025, Gilberti et al., 7 Aug 2025).
Standardizing evaluation and benchmarking to overcome reproducibility and comparability barriers across the field (Peters et al., 4 Sep 2024, Boldt et al., 3 Jul 2024).

Metrics for emergent communication thus not only enable rigorous evaluation but actively inform the design, diagnosis, and refinement of artificial language systems, facilitating progress towards human-compatible, efficient, and interpretable multi-agent communication.