- The paper reveals that Transformers reliably generalize to in-domain examples with high k-cutoff values while struggling on out-of-domain inputs.
- It demonstrates that models form latent clusters of functionally equivalent components, with the IICG metric correlating with generalization success.
- The study establishes a quadratic scaling law for required training data and shows that chain-of-thought supervision improves efficiency in multi-hop tasks.
This paper introduces the "Coverage Principle," a data-centric framework to understand and predict the compositional generalization capabilities of neural networks, particularly Transformers, when they primarily rely on pattern matching (2505.20278). The authors argue that while LLMs excel at pattern matching, they often fail at systematic compositional generalization, meaning they struggle to apply learned rules to new combinations of familiar components.
The core idea is that models can reliably generalize by substituting input fragments that are "functionally equivalent." Two fragments are functionally k-equivalent if they produce identical outputs in at least k shared contexts observed in the training data. "Coverage" is then defined as the set of all inputs reachable from the training data by performing chains of such substitutions. The Coverage Principle posits that a model relying on pattern matching can only generalize reliably to inputs within this coverage. Predictions for inputs outside this boundary are unconstrained by the training data.
To test this principle, the researchers designed four synthetic compositional tasks with varying structures: 2-Hop, Parallel 2-Hop, 3-Hop, and Non-tree. These tasks involve composing primitive functions (random mappings) to produce an output token from input tokens. For example, a 2-Hop task is (x1,x2,x3)↦t, where t=f2(f1(x1,x2),x3). Training datasets are constructed by sampling combinations where primitive functions operate on "seen" domains. GPT-2 models of varying sizes (68M to 1.5B parameters) are trained on these tasks. Evaluation is done on In-Domain (ID) test sets (unseen combinations of seen primitive operations) and Out-of-Domain (OOD) test sets (at least one unseen primitive operation).
Key experimental findings include:
- Predictive Power of k-Coverage:
- The framework has strong predictive power. Models generalize well to ID examples with high k-cutoff values (stronger evidence of functional equivalence) and much slower or not at all for low k-cutoff values.
- OOD examples, which are outside coverage, consistently show chance-level accuracy, empirically validating the coverage principle.
- This suggests that for reliable generalization, robust coverage (high k) is necessary. This can explain struggles with long-tail distributions where rare combinations have low k.
- Latent Representation Clustering:
- Models that generalize successfully learn to map functionally equivalent components (e.g., (x1,x2) pairs yielding the same intermediate b=f1(x1,x2)) into tight clusters in their latent space.
- The Intra-Inter Cosine Gap (IICG) metric, which measures this clustering, correlates with k-cutoff values. Higher k leads to higher IICG.
- Causal tracing confirms that these clustered representations are causally involved in the model's predictions for ID examples.
- Importantly, these clustered representations may not align with vocabulary embeddings, meaning standard interpretability tools like logit lens might fail to detect them unless the model is also trained on partial computations (see Appendix \ref{app:chimeric}).
- Power-Law Lower Bound for ID Generalization:
- For the 2-Hop task, the paper theoretically derives and empirically confirms that the training data (Nreq) required for full ID generalization scales at least quadratically with the token set size (∣∣X∣∣). Specifically, Nreq(∣∣X∣∣,k)=Ω~(∣∣X∣∣α(k)), where α(k)=2.5−k0.5.
- Empirically, for 2-Hop, the exponent is ~2.26. For Parallel 2-Hop, it's ~2.43, and for 3-Hop, it's ~2.58.
- Crucially, this scaling relationship is invariant to model size (tested up to 20x parameter scaling from 68M to 1.5B), suggesting the limitation is inherent to data properties and the pattern-matching mechanism, not model capacity.
- Path Ambiguity Hinders Generalization:
- Tasks with "path ambiguity," where a variable affects the output through multiple computational paths (e.g., the Non-tree task: t=f2(f1(x1,x2),x2,x3)), pose significant challenges.
- Transformers struggle to form unified representations of theoretically equivalent intermediate states. Instead, they develop context-dependent representations (e.g., representations for b=f1(x1,x2) become conditioned on x2).
- This leads to poor generalization performance and interpretability issues. Even with near-exhaustive ID training data and larger models, performance on Non-tree tasks is significantly worse than on 2-Hop tasks. IICG analysis shows clustering based on (b,x2) rather than just b. This helps explain LLM failures in planning tasks.
- Chain-of-Thought (CoT) Supervision:
- CoT supervision (training models to predict intermediate steps, e.g., (x1,x2,x3)↦(b,t)) improves data efficiency. For the 3-Hop task, the power-law exponent drops from 2.58 to 1.76.
- CoT effectively "flattens" multi-hop tasks into sequences of single-hop problems.
- However, CoT-trained models still struggle with path ambiguity in Non-tree tasks. While performance improves, it doesn't reach levels seen in simpler structures, and representations remain partially context-dependent.
The paper proposes a mechanism-based taxonomy for generalization:
- Type-I: Structure-based generalization: Relies on observed functional equivalences. This is bounded by the coverage principle.
- Type-II: Property-based generalization: Exploits intrinsic algebraic invariances of primitive functions (e.g., commutativity, group-theoretic structure in modular arithmetic, input irrelevance). Can go beyond coverage. The Reversal Curse is framed as a Type-I failure that might be partially addressed by Type-II mechanisms.
- Type-III: Shared-operator generalization: Reuses the same computation (primitive function) across different positions, often enabled by parameter sharing (e.g., in recurrent networks or Universal Transformers). Can also extend beyond coverage.
The authors argue this taxonomy helps distinguish phenomena and clarify when new architectural ideas are needed. They suggest that real-world tasks often involve a mix of these types.
Practical Implications:
- Data Requirements: Explains the data-hungry nature of compositional tasks.
- Long-Tail Knowledge: Failures on rare data can be attributed to insufficient evidence for functional equivalence (low k).
- Complex Reasoning: Difficulties in tasks like planning might stem from path ambiguities.
- Reversal Curse: Predicted by the coverage principle, as "A is B" provides no functional equivalence evidence for "B is A⁻¹".
- Interpretability: Standard techniques (e.g., logit lens) may fail if intermediate representations aren't aligned with vocabulary space, which can happen without explicit training on partial computations.
- Data Augmentation: Suggests strategies to maximize coverage by ensuring diverse shared contexts for equivalent components.
Conclusion: The coverage principle provides a framework for understanding the limits of pattern-matching learners in compositional generalization. It highlights that achieving systematic compositionality likely requires innovations beyond scaling current architectures, potentially incorporating explicit variable binding or mechanisms that robustly leverage all three types of generalization. The authors emphasize that the systematicity challenge posed by Fodor & Pylyshyn and Marcus remains largely open.
An algorithm for determining k-coverage is provided (Algorithm \ref{alg:coverage_determination}), involving building behavior maps for subsequences, identifying functionally equivalent subsequences using Union-Find based on the k-evidence threshold, constructing a substitution graph, and then finding connected components.