Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
135 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

The Coverage Principle: A Framework for Understanding Compositional Generalization (2505.20278v1)

Published 26 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a \emph{mechanism-based} taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.

Summary

  • The paper reveals that Transformers reliably generalize to in-domain examples with high k-cutoff values while struggling on out-of-domain inputs.
  • It demonstrates that models form latent clusters of functionally equivalent components, with the IICG metric correlating with generalization success.
  • The study establishes a quadratic scaling law for required training data and shows that chain-of-thought supervision improves efficiency in multi-hop tasks.

This paper introduces the "Coverage Principle," a data-centric framework to understand and predict the compositional generalization capabilities of neural networks, particularly Transformers, when they primarily rely on pattern matching (2505.20278). The authors argue that while LLMs excel at pattern matching, they often fail at systematic compositional generalization, meaning they struggle to apply learned rules to new combinations of familiar components.

The core idea is that models can reliably generalize by substituting input fragments that are "functionally equivalent." Two fragments are functionally kk-equivalent if they produce identical outputs in at least kk shared contexts observed in the training data. "Coverage" is then defined as the set of all inputs reachable from the training data by performing chains of such substitutions. The Coverage Principle posits that a model relying on pattern matching can only generalize reliably to inputs within this coverage. Predictions for inputs outside this boundary are unconstrained by the training data.

To test this principle, the researchers designed four synthetic compositional tasks with varying structures: 2-Hop, Parallel 2-Hop, 3-Hop, and Non-tree. These tasks involve composing primitive functions (random mappings) to produce an output token from input tokens. For example, a 2-Hop task is (x1,x2,x3)t(x_1, x_2, x_3) \mapsto t, where t=f2(f1(x1,x2),x3)t=f_2(f_1(x_1,x_2),x_3). Training datasets are constructed by sampling combinations where primitive functions operate on "seen" domains. GPT-2 models of varying sizes (68M to 1.5B parameters) are trained on these tasks. Evaluation is done on In-Domain (ID) test sets (unseen combinations of seen primitive operations) and Out-of-Domain (OOD) test sets (at least one unseen primitive operation).

Key experimental findings include:

  1. Predictive Power of k-Coverage:
    • The framework has strong predictive power. Models generalize well to ID examples with high kk-cutoff values (stronger evidence of functional equivalence) and much slower or not at all for low kk-cutoff values.
    • OOD examples, which are outside coverage, consistently show chance-level accuracy, empirically validating the coverage principle.
    • This suggests that for reliable generalization, robust coverage (high kk) is necessary. This can explain struggles with long-tail distributions where rare combinations have low kk.
  2. Latent Representation Clustering:
    • Models that generalize successfully learn to map functionally equivalent components (e.g., (x1,x2)(x_1, x_2) pairs yielding the same intermediate b=f1(x1,x2)b=f_1(x_1,x_2)) into tight clusters in their latent space.
    • The Intra-Inter Cosine Gap (IICG) metric, which measures this clustering, correlates with kk-cutoff values. Higher kk leads to higher IICG.
    • Causal tracing confirms that these clustered representations are causally involved in the model's predictions for ID examples.
    • Importantly, these clustered representations may not align with vocabulary embeddings, meaning standard interpretability tools like logit lens might fail to detect them unless the model is also trained on partial computations (see Appendix \ref{app:chimeric}).
  3. Power-Law Lower Bound for ID Generalization:
    • For the 2-Hop task, the paper theoretically derives and empirically confirms that the training data (NreqN_{\mathrm{req}}) required for full ID generalization scales at least quadratically with the token set size (X||\mathcal{X}||). Specifically, Nreq(X,k)=Ω~(Xα(k))N_{\mathrm{req}}(||\mathcal{X}||,k) = \tilde\Omega(||\mathcal{X}||^{\alpha(k)}), where α(k)=2.50.5k\alpha(k) = 2.5 - \frac{0.5}{k}.
    • Empirically, for 2-Hop, the exponent is ~2.26. For Parallel 2-Hop, it's ~2.43, and for 3-Hop, it's ~2.58.
    • Crucially, this scaling relationship is invariant to model size (tested up to 20x parameter scaling from 68M to 1.5B), suggesting the limitation is inherent to data properties and the pattern-matching mechanism, not model capacity.
  4. Path Ambiguity Hinders Generalization:
    • Tasks with "path ambiguity," where a variable affects the output through multiple computational paths (e.g., the Non-tree task: t=f2(f1(x1,x2),x2,x3)t=f_2(f_1(x_1,x_2), x_2, x_3)), pose significant challenges.
    • Transformers struggle to form unified representations of theoretically equivalent intermediate states. Instead, they develop context-dependent representations (e.g., representations for b=f1(x1,x2)b=f_1(x_1,x_2) become conditioned on x2x_2).
    • This leads to poor generalization performance and interpretability issues. Even with near-exhaustive ID training data and larger models, performance on Non-tree tasks is significantly worse than on 2-Hop tasks. IICG analysis shows clustering based on (b,x2)(b, x_2) rather than just bb. This helps explain LLM failures in planning tasks.
  5. Chain-of-Thought (CoT) Supervision:
    • CoT supervision (training models to predict intermediate steps, e.g., (x1,x2,x3)(b,t)(x_1,x_2,x_3)\mapsto(b, t)) improves data efficiency. For the 3-Hop task, the power-law exponent drops from 2.58 to 1.76.
    • CoT effectively "flattens" multi-hop tasks into sequences of single-hop problems.
    • However, CoT-trained models still struggle with path ambiguity in Non-tree tasks. While performance improves, it doesn't reach levels seen in simpler structures, and representations remain partially context-dependent.

The paper proposes a mechanism-based taxonomy for generalization:

  • Type-I: Structure-based generalization: Relies on observed functional equivalences. This is bounded by the coverage principle.
  • Type-II: Property-based generalization: Exploits intrinsic algebraic invariances of primitive functions (e.g., commutativity, group-theoretic structure in modular arithmetic, input irrelevance). Can go beyond coverage. The Reversal Curse is framed as a Type-I failure that might be partially addressed by Type-II mechanisms.
  • Type-III: Shared-operator generalization: Reuses the same computation (primitive function) across different positions, often enabled by parameter sharing (e.g., in recurrent networks or Universal Transformers). Can also extend beyond coverage.

The authors argue this taxonomy helps distinguish phenomena and clarify when new architectural ideas are needed. They suggest that real-world tasks often involve a mix of these types.

Practical Implications:

  • Data Requirements: Explains the data-hungry nature of compositional tasks.
  • Long-Tail Knowledge: Failures on rare data can be attributed to insufficient evidence for functional equivalence (low kk).
  • Complex Reasoning: Difficulties in tasks like planning might stem from path ambiguities.
  • Reversal Curse: Predicted by the coverage principle, as "A is B" provides no functional equivalence evidence for "B is A⁻¹".
  • Interpretability: Standard techniques (e.g., logit lens) may fail if intermediate representations aren't aligned with vocabulary space, which can happen without explicit training on partial computations.
  • Data Augmentation: Suggests strategies to maximize coverage by ensuring diverse shared contexts for equivalent components.

Conclusion: The coverage principle provides a framework for understanding the limits of pattern-matching learners in compositional generalization. It highlights that achieving systematic compositionality likely requires innovations beyond scaling current architectures, potentially incorporating explicit variable binding or mechanisms that robustly leverage all three types of generalization. The authors emphasize that the systematicity challenge posed by Fodor & Pylyshyn and Marcus remains largely open.

An algorithm for determining kk-coverage is provided (Algorithm \ref{alg:coverage_determination}), involving building behavior maps for subsequences, identifying functionally equivalent subsequences using Union-Find based on the kk-evidence threshold, constructing a substitution graph, and then finding connected components.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com