Info-Theoretic Framework for Object Naming

Updated 1 December 2025

The paper formalizes discrete naming as a stochastic channel balancing communicative accuracy and lexicon complexity via IB and MDL principles.
It employs iterative algorithms such as Blahut–Arimoto and MDL-inspired measures to optimize naming efficiency, validated across domains like color, kinship, and artifacts.
The framework unifies cognitive, linguistic, and algorithmic insights, showing that optimal listener-speaker alignment closely approximates human naming behavior.

An information-theoretic framework for discrete object naming systems formalizes the lexicon formation and naming behavior observed in natural languages as an optimal trade-off between expressivity (communicative accuracy or informativeness) and parsimony (lexicon complexity). This approach connects discrete category systems in domains such as color, artifact, animal, and kinship naming to principles such as the Information Bottleneck (IB), minimum description length (MDL), and efficient coding. The framework rigorously defines the underlying communication channel, codifies essential cost/accuracy trade-offs, and is empirically validated on a range of linguistic phenomena, demonstrating that human naming systems often closely approximate theoretical optima.

1. Formal Structure and Core Modeling Paradigms

At its foundation, the framework models object naming as a stochastic channel in which a speaker, presented with a stimulus $X$ from a discrete set $\mathcal{X}$ (e.g., kinship roles, color chips, artifacts), emits a label or signal $Z$ (or $W$ ), chosen according to a conditional distribution $q_s(z|x)$ (or $p(w|x)$ ), and a listener decodes this back to an intended meaning or object.

Key random variables:

$X \in \mathcal{X}$ : Intended object (e.g., kin term, color chip)
$Z \in \mathcal{Z}$ / $W \in \mathcal{W}$ : Emitted label (word, signal)
$U$ : Set of perceptual or semantic features, or $Y$ : communicative targets
$q_s(z|x)$ / $q(w|x)$ : Encoding channel of the speaker
$q_l(x|z)$ : Listener's decoding channel (possibly Bayesian)

The IB framework frames the design or emergence of $q(w|x)$ as an optimization that interpolates between two objectives:

Complexity ( $I(X;W)$ ): Mutual information between objects and labels, quantifying the cost or size of the lexicon.
Accuracy ( $I(Y;W)$ or $I(W;U)$ ): Mutual information reflecting how well the label preserves the communicative target or semantic content.

The central IB Lagrangian is:

$L_{\mathrm{IB}}[p(w|x)] = I(X;W) - \beta\,I(Y;W)$

where $\beta \ge 0$ governs the trade-off between compression and informativeness (Zaslavsky et al., 2018, Zaslavsky et al., 2019).

Alternatively, in MDL-inspired accounts:

$C(L) = |L| + \frac{1}{N}\sum_{i=1}^N |\pi_L^{(i)}|$

sums the fixed cost of storing a library $L$ of part-concepts with the average description length for $N$ objects (Wong et al., 2022).

In referential game formulations, the joint cost for speaker-listener pairs is:

$\mathcal{F}_{\lambda}(q_s, q_l) = -\mathbb{E}_{x,z} [\log q_l(x|z)] + \lambda\,I_{q_s}(X;Z)$

which is minimized subject to a constraint or dual penalty on complexity (Le et al., 24 Nov 2025).

2. Optimization Objectives and Theoretical Guarantees

Under these frameworks, system optimality is characterized via the Pareto frontier or information plane defined by the pairs $(C, A) = (I(X;W), I(W;U))$ . The trade-off curve arises because maximal compression (small $I(X;W)$ ) yields trivial or ambiguous naming, while maximal informativeness (large $I(Y;W)$ ) demands a large, often individuated lexicon.

For the referential game paradigm, a main theorem establishes that the optimal trade-off is achievable if and only if the listener's decoder $q_l(x|z)$ matches the speaker's Bayesian posterior $\tilde q_s(x|z) = \frac{q_s(z|x)p(x)}{p_s(z)}$ :

$\mathcal{F}_\lambda^* = H(X) + (\lambda-1)C^* \quad \text{with} \quad q_l^*(x|z) = \tilde q_s(x|z)$

where $H(X)$ is the entropy of the object prior (Le et al., 24 Nov 2025). The KL-divergence term between $\tilde q_s(\cdot|z)$ and $q_l(\cdot|z)$ quantifies listener suboptimality; the bound is tight when the listener is optimal.

In IB-based models, optimization is performed via iterative Blahut–Arimoto-style updates:

$q_{new}(w|x) \propto q(w)\,\exp\left(-\beta\,D_{\mathrm{KL}}[m_x\Vert m_w]\right)$

with simultaneous updates for marginals and prototype meanings (Zaslavsky et al., 2019). The number of effective categories exhibits phase transitions as $\beta$ increases, capturing the hierarchical emergence of basic, intermediate, and fine-grained categories (Zaslavsky et al., 2018).

3. Empirical Evaluation and Domain Instantiations

This framework has been instantiated and empirically validated across multiple domains:

Color naming: Naming patterns in the World Color Survey and American English are shown to lie within $\approx 0.18$ bits of the IB bound, with category emergence governed by structural phase transitions (Zaslavsky et al., 2018).
Artifact and animal categories: Container and animal naming in Dutch and French align with the IB-optimal trade-off curve, showing inefficiency values under 20% and gNID values near 0.1. IB-derived hierarchies mirror established cross-linguistic taxonomic growth trajectories (Zaslavsky et al., 2019).
Object part concepts via concept libraries: The trade-off between library complexity (number of reusable primitives/subroutines) and average description length yields a U-shaped cost curve, with optimal intermediate abstractions closely matching human-chosen lexica. Library-language alignment is quantitatively assessed via held-out log-likelihood using IBM Model 1 alignment (Wong et al., 2022).
Kinship naming: Learned referential game policies over 32 kinship roles and neural agents empirically trace the analytic lower bound between code complexity $C$ and log-loss $L$ . Variation in real-language communicative need $p(x)$ reproduces cross-linguistic differences. Listener suboptimality directly degrades trade-off optimality (Le et al., 24 Nov 2025).

A table summarizing representative empirical metrics:

Domain	Complexity–Accuracy Bound	Empirical Departure	Emergence Pattern
Color naming	$I(X;W)-\beta I(Y;W)$	$\Delta F \sim 0.18$	Phase transitions in $K(\beta)$
Artifact/Animal	$I(X;W)-\beta I(W;U)$	Inefficiency $< 0.2$ , gNID $\sim 0.1$	Hierarchical category splits
Kinship	$L=H(X)-C$	Near analytic bound	Lexicon clusters by type

4. Algorithmic and Implementation Aspects

Practical computation relies on scalable mutual information estimation and iterative minimization techniques:

Blahut–Arimoto algorithm: For IB optimizations, iteratively updates encoder $q(w|x)$ and listener prototypes $m_w$ using KL-divergence terms. Each iteration requires $O(|X||W||U|)$ complexity, tractable for moderate domain sizes (Zaslavsky et al., 2018, Zaslavsky et al., 2019).
MDL and library search: U-shaped trade-offs in code description are evaluated over hierarchically constructed libraries, with log-likelihood alignment to human language measured by machine translation metrics such as IBM Model 1 (Wong et al., 2022).
Neural referential games: Neural encoders (e.g., RGCN) for both speaker and listener, sampling object–label mappings and backpropagation (REINFORCE or Gumbel-Softmax) to minimize joint loss (Le et al., 24 Nov 2025).

Convergence is detected via small changes in the objective (e.g., $L_{\mathrm{IB}}$ or $\mathcal{F}_\lambda$ ), and held-out validation confirms that fitted systems do not overfit empirical naming data.

5. Theoretical and Cognitive Implications

The information-theoretic approach explains a suite of phenomena:

Basic-level categories: The optimum lies at intermediate abstraction, supporting the emergence of cognitively basic-level names (drawers, wheels, etc.) (Wong et al., 2022).
Soft categories and inconsistency: Observed "ambiguities" and probabilistic naming correspond to efficient occupancy of the naming manifold, not error (Zaslavsky et al., 2019).
Hierarchical evolution: As communicative demand (modeled by $\beta$ or $p(x)$ ) changes, systems traverse the trade-off curve, yielding implicational hierarchies (e.g., color, animal naming) that match cross-linguistic progression (Zaslavsky et al., 2019).
Listener-optimality is necessary: The theoretical bound is achieved only when the listener mirrors the speaker’s Bayesian decoder, confirmed in both artificial and human communication experiments (Le et al., 24 Nov 2025).

A plausible implication is that variation in naming systems across domains and languages is largely determined by differences in source distributions $p(x)$ and perceptual or cognitive metric embeddings.

6. Methodological Unification and Broad Applicability

The framework unifies MDL/efficient coding approaches to vocabulary structure with program-induction models of perceptual organization and provides quantitative tools for evaluating empirical lexica (Wong et al., 2022). Theoretical, algorithmic, and empirical components are generalizable to any discrete domain that admits a prior over entities and a similarity kernel or meaning embedding (Zaslavsky et al., 2019, Zaslavsky et al., 2018).

Among its methodological contributions:

Leveraging free-form language data and machine-translation alignment to infer latent inventories of mental concepts (Wong et al., 2022)
Evaluating natural and artificial naming systems by plotting their accuracy/complexity coordinates against the IB frontier, with deviation indicating inefficiency, over-complexity, or suboptimal listener architecture (Zaslavsky et al., 2019, Le et al., 24 Nov 2025)
Demonstrating experimentally that neural emergent communication systems robustly realize theoretically optimal naming policies when trained with coupled objectives (Le et al., 24 Nov 2025)

The information-theoretic framework thus provides a mathematically grounded, empirically validated, and broadly applicable theory of how discrete lexical categories in human languages arise from fundamental trade-offs intrinsic to the problem of efficient communication.