Compositionality Gap: Theory and Practice

Updated 6 October 2025

Compositionality gap is the discrepancy where composite structures do not systematically inherit properties from individual components, affecting fields like NLP, logic, and multi-modal systems.
Diagnostic methods such as modulus of continuity, tree reconstruction error, and canonical correlation analysis rigorously quantify how far practical models deviate from theoretical compositionality.
Emerging solutions—including compositionality-aware networks and hybrid symbolic-connectionist models—aim to close the gap and improve generalization and robustness in artificial systems.

Compositionality gap denotes the observed discrepancy between the theoretical or desirable property that the meaning, behavior, or representation of composite structures can be determined systematically from those of their components—and the practical failure of models, algorithms, or semantics to ensure this property in complex or real-world systems. The gap has been identified and scrutinized across a spectrum of settings: behavioral metrics for probabilistic processes, logical consequence and truth-functional reductions, neural representation learning, emergent communication, natural language processing, structured knowledge embeddings, and multi-modal systems. This entry provides a structured overview encompassing formal characterizations, core methodologies, empirical studies, diagnostic metrics, theoretical developments, and prospects for closing the compositionality gap.

1. Formal Characterizations of Compositionality and the Gap

Compositionality is typically understood as the principle that the properties (e.g., meaning, behavior, or function) of a complex system are determined by those of its parts and the rules used to combine them. In linguistics, this often refers to the classical Fregean dictum; in process theory, to the propagation of errors or distances under composition; in logic, to the determination of compound formulae’s truth values; and in representation learning, to the reconstructibility of representations from parts.

The compositionality gap arises when this principle is not instantiated in practice despite being theoretically possible or desirable. Notable formalizations include:

Modulus of Continuity for Operators: In probabilistic process algebra, the modulus of continuity $z_f$ of an operator $f$ quantitatively bounds the amplification of errors under composition. Uniform continuity (i.e., the existence of such a moduli, with $z_f(0,\ldots,0)=0$ and continuity at 0) is required for stable compositional reasoning. When operators fail to be uniformly continuous, the compositionality gap manifests as potential unbounded propagation of errors (Gebler et al., 2014).
Truth-Functionality in Logical Semantics: In the context of Suszko’s problem, the gap is observed in the reduction of logic semantics to minimal truth values; the Scott–Suszko reduction, while capturing consequence relations, often fails to preserve truth-functionality, rendering the meaning of compound formulae not truth-functional with respect to their components. The truth-functional Scott–Suszko reduction restores compositionality under additional constraints such as regularity and compactness (Chemla et al., 2017).
Additivity and Set-Theoretic Structure in Embeddings: For embeddings in NLP, the expectation is that their geometric structure respects set-theoretic operations (intersection, difference, union), as in “TextOverlap,” “TextDifference,” and “TextUnion.” The gap is exposed when embedding models, especially LLMs, deviate from these linear (additive) or geometric relationships, even though task performance remains high (Bansal et al., 28 Feb 2025, Guo et al., 14 Sep 2025).
Homotopy-Theoretic Invariants: In categorical models, compositionality is formalized via the strength of (op)lax functors: the compositionality gap is measured by the failure of certain structural maps (laxators) to be isomorphisms. The degree of non-invertibility is captured by zeroth and first homotopy posets $\pi_0$ and $\pi_1$ , giving a graded measure of deviation from perfect compositionality (Puca et al., 2023).

2. Diagnostic Metrics and Methodologies

Quantifying the compositionality gap requires methodological rigor and carefully constructed metrics:

Bisimulation Metrics and Upper Bounds: In probabilistic processes, behavior is compared by the bisimulation metric, with the compositionality property of operators computed via recursive denotational semantics that tracks process variable multiplicities. Upper bounds on process distance after composition provide actionable diagnostics for the magnitude of compositionality failure (Gebler et al., 2014).
Tree Reconstruction Error (TRE): For learned representations, TRE measures the discrepancy between the actual representation and a recursively composed surrogate built from hypothetical primitives and a fixed or parameterized operation, optimized globally. Low TRE is strong evidence for compositionality in the representation space, while high TRE highlights the gap (Andreas, 2019, Korbak et al., 2020).
Canonical Correlation Analysis (CCA) and Leave-One-Out Reconstruction: For embeddings and knowledge graphs, CCA measures the global linear alignment between semantic attributes and learned vectors, while additive generalization is tested by reconstructing entity embeddings from subsets of their attributes using leave-one-out cross-validation, with metrics including L2 loss, cosine similarity, and retrieval accuracy (Guo et al., 14 Sep 2025).
Logical Rank Analysis & Truth-Functionality: In logic, the compositionality gap is quantified by determining Suszko rank and checking whether truth-functional semantics of minimal rank can be achieved without sacrificing the compositional nature of the connectives (Chemla et al., 2017).
Systematic Task Protocols: In sequence-to-sequence neural models, the gap is exposed by orthogonal test batteries—systematicity (recombination), productivity (length generalization), localism (composition order), substitutivity (synonym invariance), and overgeneralization (handling exceptions)—with empirical gaps often reaching 20–50% even when model-level performance is high (Hupkes et al., 2019).

3. Empirical Manifestations Across Domains

The compositionality gap has been empirically demonstrated across a diverse set of tasks, architectures, and modalities.

NLP and Embeddings: Additive compositionality, as measured by the geometric alignment of sentence or word embeddings to their set-theoretic compositions, is observed strongly in models like SBERT but not in state-of-the-art LLM-based encoders (Bansal et al., 28 Feb 2025). Even when LLMs excel in downstream tasks, their embedding spaces can lack set-theoretic or additive structure (Guo et al., 14 Sep 2025).
Vision-LLMs and Multimodal Systems: In MMCOMPOSITION, leading VLMs underperform on fine-grained compositional reasoning (e.g., counting, difference spotting), with state-of-the-art models achieving only 67–68% accuracy in contrast to humans who reach 90%. Deficits are attributed to architectural choices (e.g., visual encoder downsampling) and insufficiently diverse training data (Hua et al., 13 Oct 2024).
Representation Learning and Emergent Communication: Non-trivial compositionality—the ability to reflect complex, context-dependent operations in representations or protocols—remains undetected by most prevailing metrics, except for tree reconstruction error. Current emergent communication systems often only encode trivial compositionality, failing to model phenomena such as ordering, negation, or context-dependence akin to human language (Korbak et al., 2020).
Probabilistic and Logical Systems: While robust, fixed-point bisimulation metrics characterize the amplification of process distances under composition, failure modes are linked to operator specifications in the underlying structural operational semantics that induce unbounded or discontinuous error propagation (Gebler et al., 2014).
Meta-Learning and Systematic Generalization: Recent neural architectures intended for systematic meta-learning generalization (e.g., episodic grammars) show pervasive errors in episodes demanding true compositional inference, as indicated by pattern memorization, non-systematic mapping, and critical failures on out-of-distribution rules, sustaining the critique posed by Fodor and Pylyshyn (Woydt et al., 2 Jun 2025).

4. Theoretical Advances and Rule-Based Closures

Several theoretical developments have been advanced to close or analyze the compositionality gap:

Modulus of Continuity Rule Formats: By associating to each process combinator a modulus of continuity that is explicitly computed from the denotational semantics (i.e., tracking variable replication), it is possible to guarantee bounded error amplification and thereby close the compositionality gap for complex probabilistic systems (Gebler et al., 2014).
Truth-Functional Semantics with Minimal Rank: The truth-functional Scott–Suszko reduction shows that for compact logics with regular connectives, full compositionality (truth-functionality) can be achieved at minimal logical rank, provided additional semantic constraints are met. This addresses the gap inherent in classical reductions that collapse truth values but lose functional compositionality (Chemla et al., 2017).
Homotopy Posets for Obstruction Classification: Categorical analysis via the zeroth and first homotopy posets provides a systematic language to classify and measure the obstruction to compositionality—including the distinction between failure of existence (π₀) and uniqueness (π₁) of constructions—enabling a fine-grained diagnosis in categorical, quantum, and graph-based systems (Puca et al., 2023).
Task-Explicit Networks and Inductive Biases: The design of compositionality-aware transformers, such as CAT which integrates explicit decomposition and reconstruction via codebooks coupled to sememe supervision, shows measurable improvement in compositional generalization and robustness while retaining standard task performance (Huang et al., 2023).
Benchmarks and Synthetic Datasets: Comprehensive compositionality evaluation suites, such as MMCOMPOSITION for VLMs, carefully curated synthetic fusion datasets for set-theoretic operations, and episodic meta-learning protocols for rule-based transduction, provide the empirical infrastructure to systematically probe and track the gap (Hua et al., 13 Oct 2024, Bansal et al., 28 Feb 2025, Hupkes et al., 2019).

5. Generalization, Transmission, and Functional Compositionality

Key empirical findings challenge assumptions about the intrinsic link between compositionality and desired properties such as generalization and learning speed:

Compositionality is Sufficient but Not Necessary for Generalization: In emergent communication and deep multi-agent systems, compositional languages do not always correlate with better generalization; non-compositional codes can be equally or more effective depending on the distributional or task alignment properties of the environment (Kharitonov et al., 2020, Chaabouni et al., 2020).
Transmission Advantage: Highly compositional protocols emerge as more transmissible and learnable by diverse agents, providing an explanation for the evolutionary stability of compositionality in natural language as opposed to emergent artificial languages (Chaabouni et al., 2020).
Functional Compositionality in PLMs: Zero-shot functional composition—where a model is expected to directly solve composite tasks without explicit intermediate outputs (e.g., cross-lingual summarization)—remains largely unsolved. PLMs without explicit compositional training show severe performance drops on diagonal compositions, and naive prompt concatenation strategies fail to endow them with human-like task chaining (Yu et al., 2023).

6. Ongoing Challenges, Open Problems, and Future Directions

Bridging the compositionality gap remains a foundational objective in both theoretical and applied domains. Active research directions and open challenges include:

Expanding Beyond Trivial Compositionality: Most metrics and benchmarks (e.g., in emergent communication) only account for trivial intersection or concatenative compositionality; advancing toward measuring and inducing non-trivial, structure-sensitive, and contextually entangled composition is critical (Korbak et al., 2020).
Diagnostic Isolation of the Gap: The development of metrics sensitive both to additive (linear) and non-linear or context-dependent forms of compositionality is necessary to diagnose fine-grained failures—especially in deep representations and multi-modal models.
Closing via Inductive Biases and Modularization: Hybrid symbolic-connectionist models, regularization for intermediate layer compositionality, explicit codebook partitioning in transformers, and categorical frameworks for obstruction elimination offer promising avenues but require further generalization and validation (Huang et al., 2023, Puca et al., 2023).
Empirical Infrastructure: Scalable, modality-agnostic benchmarks—incorporating controlled, task-independent criteria—are increasingly available, but extension to more languages, modalities, and system architectures is needed to obtain a comprehensive understanding (Bansal et al., 28 Feb 2025, Hua et al., 13 Oct 2024).
Reconciling Generalization and Compositionality: The empirical dissociation between compositionality and generalization implies that system design should explicitly align task demands, inductive pressures, and data distributions to elicit desired compositional strategies (Kharitonov et al., 2020).
Reflection, Iterative Processing, and Memory: Iterative meta-reasoning, explicit rule extraction, and symbolic memory integration are being explored to equip neural systems with capabilities for systematic compositional inference, with the goal of approaching the flexibility, productivity, and causality of human cognition (Woydt et al., 2 Jun 2025).

7. Summary Table: Domains and Manifestations of the Compositionality Gap

Domain	Gap Manifestation	Diagnostic/Closure
Probabilistic processes	Error amplification under operators	Modulus of continuity (Gebler et al., 2014)
Logical consequences	Loss of truth-functionality in reduction	Truth-functional reduction (Chemla et al., 2017)
Representation learning	TRE, failure of additive reconstruction	Tree reconstr. error, CCA (Andreas, 2019, Guo et al., 14 Sep 2025)
Sentence/word embeddings	Deviation from set-theoretic geometry	Set-theoretic ops/test (Bansal et al., 28 Feb 2025)
Multi-agent/emergent language	Triviality, lack of systematic syntax	Non-trivial metrics, TRE (Korbak et al., 2020)
Vision-LLMs	Inability to handle fine-grained composition	MMCOMPOSITION, accuracy gap (Hua et al., 13 Oct 2024)

The presence, persistence, and ongoing characterization of the compositionality gap mark it as a central structural and functional challenge at the intersection of compositional semantics, systemic generalization, robust learning, and reliable reasoning in artificial systems.