Systematic Generalization in AI

Updated 24 June 2026

Systematic generalization is the ability of models to recombine learned primitives into novel configurations, driving compositional reasoning across diverse domains.
Architectural mechanisms like neural-symbolic hybrids, modular networks, and attention-based models actively embed inductive biases to enhance out-of-distribution performance.
Data strategies such as increased diversity, controlled burstiness, and targeted curriculum design yield significant improvements in reducing the generalization gap.

Systematic generalization refers to the capacity of a learning system to correctly and robustly process novel combinations of components, rules, or concepts—generalizing to configurations never seen during training, so long as their parts and their modes of combination were sufficiently covered. This capability, which underpins human compositional reasoning, is widely regarded as essential for flexible intelligence and remains a core challenge for neural models across domains including language, vision, grounded reasoning, reinforcement learning, and combinatorial optimization. Recent progress has illuminated both architectural and data-driven inductive biases that support or hinder systematic generalization, though even state-of-the-art deep networks often fall short in compositional OOD settings.

1. Foundational Definitions and Theoretical Scope

Systematic generalization, sometimes termed compositional generalization, is distinguished from i.i.d. generalization by its focus on structured out-of-distribution (OOD) extrapolation: a model, trained on a subset of all possible compositions of primitives, must generate correct outputs for previously unseen but permissible recombinations (Li, 2022, Wold et al., 19 May 2025). Precise definitions vary per domain, but canonical instances include:

Language: After seeing verbs {walk, jump} and adverbs {cautiously, quickly} in some combinations, the learner must predict the meaning of "jump cautiously" having never seen it during training (Ruis et al., 2020, Mondorf et al., 2 Apr 2025).
Vision and VQA: The ability to answer about new pairs "red cylinder" or "left_of blue triangle" when all constituent attributes/relationships occurred in isolation or different pairings at train time (Bahdanau et al., 2018, Bahdanau et al., 2019, Rahimi et al., 2023).
Reinforcement Learning: Solving planning tasks in unseen environments whose transition dynamics factor into shared and environment-specific causal mechanisms (Mutti et al., 2022).

Mathematically, one typical formalization is: for input space $X$ , output space $Y$ , and some compositional splitting of $X$ , a model $f: X \to Y$ achieves systematic generalization if, trained on $(x,y)$ pairs from a restricted $p_\text{train}(x,y)$ , it outputs correct $y$ for all $x$ in a test set that comprises novel combinations of primitive factors seen during training (Wold et al., 19 May 2025). Quantitatively, the systematic generalization gap is often measured as $\Delta = \text{Acc}_\text{train} - \text{Acc}_\text{ood}$ (Rahimi et al., 2023).

2. Inductive Biases and Architectural Mechanisms

Systematic generalization critically depends on the presence of correct inductive biases in the learner. Several mechanisms and classes of architectures have been studied:

Symbolic and Hybrid (Neural-Symbolic) Approaches

Symbolic methods and hybrid systems—such as Neural-Symbolic Recursive Machines (NSR) (Li et al., 2022)—hardwire or induce compositional structure explicitly. NSR constructs a latent Grounded Symbol System: a directed tree whose nodes encode symbols with semantic values, learned jointly with neural perception, syntactic parsing, and semantic program induction modules. Its deduction–abduction learning algorithm iteratively samples and refines tree-structured symbolic representations, introducing inductive biases of equivariance and recursive compositionality via pointwise, factorized likelihoods.

Other neural-symbolic frameworks include Neural Module Networks (NMN), with explicit program-defined module layouts whose degree and location of modularity is a key determinant of systematic generalization (D'Amario et al., 2021, Bahdanau et al., 2018).

Attention and Relational Mechanisms

Edge Transformers (Bergen et al., 2021) augment the Transformer paradigm with edge-centric representations and triangular attention, enabling explicit compositional chaining over relations (edges) in a discrete or relational structure—addressing the shortfalls of standard self-attention and GNN message-passing for recombining relational primitives.

Modular and Compositional Models

Chart-based parsers (Bogin et al., 2020), meta-learned compositional architectures (Mondorf et al., 2 Apr 2025), and tree-structured NMNs (Bahdanau et al., 2018) introduce explicit or induced intermediate representations corresponding to compositional structure (e.g., constituent parse trees, functional programs, or modular reasoning layouts), substantially boosting systematicity in both language and visual grounding domains.

3. Data Distributional Properties as Inductive Bias

Inductive bias is also imparted by data properties. Systematic generalization can be heavily influenced—even enabled—by specific properties of the training distribution (Rio et al., 27 Feb 2025, Rahimi et al., 2023). Key observed factors:

Diversity: Expanding the support size of latent factors (e.g., increasing the number of unique colors or adverbs in the training data) reduces reliance on spurious statistical associations and forces the learner to disentangle attributes, yielding large gains (up to +89% OOD accuracy) (Rio et al., 27 Feb 2025, Rahimi et al., 2023).
Burstiness and Interventions: Controlled within-context attribute diversity (burstiness) and random per-sample interventions break co-occurrence biases, further encouraging factorized representations (Rio et al., 27 Feb 2025).
Mutual Information and Parallelism: Systematic generalization is strongly predicted by normalized mutual information (NMI) between latent factors in the data; lower NMI prompts more parallel, analogy-friendly neural geometries (Rio et al., 27 Feb 2025).

Empirically, mixing simple but diverse compositions into training (D3 principle) provides dramatic systematicity improvements regardless of the similarity between train and test, and is more data-efficient in modular than monolithic architectures (Rahimi et al., 2023).

4. Benchmarking and Quantitative Evaluation

Systematic generalization is probed with compositional and OOD splits designed to require robust recombination:

SCAN and gSCAN: Sequence-to-sequence mappings in synthetic language-to-action or grounded command settings, with compositional splits holding out verb-adverb, color-shape, or directional combinations (Ruis et al., 2020, Gao et al., 2020).
CLOSURE and CLEVR-CoGenT: Visual reasoning and VQA tasks with novel attribute-relation or program-template combinations (Bahdanau et al., 2019, D'Amario et al., 2021).
CFQ and SyGNS: Semantic parsing benchmarks controlling for graph-structural or syntactic novelty (Yanaka et al., 2021, Bergen et al., 2021).
Meta-learning for Compositionality: SYGAR demonstrates systematicity in spatial reasoning with dynamically composed transformation grammars (Mondorf et al., 2 Apr 2025).

Metrics are typically Exact Match accuracy, OOD accuracy, and the systematicity gap $\Delta$ . Notable results include NSR achieving 100% accuracy on all SCAN and PCFG splits (far exceeding conventional and prior hybrid models) (Li et al., 2022), and GLT's latent CKY compositionality achieving 96.1% accuracy on CLOSURE, outperforming both end-to-end and neuro-symbolic baselines (Bogin et al., 2020).

5. Limitations, Ablations, and Open Challenges

Despite progress, several limitations remain:

Architectural fragility: Many mechanisms for modularity, compositionality, or variable binding are sensitive to layout, parametrization, or hard-coded priors. NMNs require carefully chosen module layouts; end-to-end approaches often converge to configuration with poor systematic generalization unless strongly regularized (Bahdanau et al., 2018, D'Amario et al., 2021).
Data-hungriness: Vanilla seq2seq architectures can achieve perfect systematicity if data support is information-rich (high entropy), but performance collapses at low entropy—suggesting the impossibility of data-only solutions in structured, sparse regimes (Wold et al., 19 May 2025).
Interpretability and reasoning transparency: Models capable of implicit generalization (e.g., "no-proof" Transformer models in theorem-proving) may succeed in OOD tasks but render their computation opaque (Gontier et al., 2020).
Generalization beyond limited compositionality: Even compositional or modular models often fail on deeper recursive structures, variable binding, or higher-arity compositional splits if the requisite inductive bias is absent or data is insufficiently rich (Yanaka et al., 2021, Bahdanau et al., 2019).

Ablation studies consistently confirm the necessity of recursive, pointwise likelihoods, explicit modularity, and compositional interpretations at all levels of the network—removing these (e.g., abducing, parsing, enforcing equivariance, or modularization) sharply degrades systematicity (Li et al., 2022, D'Amario et al., 2021, Bahdanau et al., 2018).

6. Unifying Mechanisms and Future Directions

Synthesizing across domains, inductive bias supporting systematic generalization arises from the factorization of computation—embodied in module architectures, neural-symbolic hybrids, or meta-learning regimes—and is strengthened by designed data distributions to encourage independent, analogical, or algebraically structured representations (Li et al., 2022, Mondorf et al., 2 Apr 2025, Rio et al., 27 Feb 2025). Key principles with strong empirical and sometimes provable support include:

Equivariance and Compositionality: Recursive, pointwise processing (as in NSR) guarantees permutation equivariance and compositionality by construction (Li et al., 2022, Mondorf et al., 2 Apr 2025).
Causality and Factorization: Factoring models per the causal graph of the domain—e.g., per-variable in time series (Bansal et al., 2021), or via shared causal transition models in reinforcement learning (Mutti et al., 2022)—provably supports transfer and systematic extrapolation.
Data-centric approaches: Manipulation of data entropy and mutual independence of latent factors provides a powerful and easily deployable lever for making systematic generalization possible even for generic neural nets (Wold et al., 19 May 2025, Rahimi et al., 2023, Rio et al., 27 Feb 2025).

Challenges remain in producing scalable, flexible architectures that retain interpretability and modularity while handling the intricacies of real-world, high-dimensional data. Key open directions are the unification of symbolic and distributed methods, meta-learning of structural priors, data-efficient curriculum design for compositionality, extension to cross-domain, temporally coupled, and multi-agent settings, and a principled mathematical characterization of systematic generalization's necessary and sufficient conditions (Li, 2022, Memon et al., 3 May 2026).

References:

(Li et al., 2022): "Neural-Symbolic Recursive Machine for Systematic Generalization"
(Bergen et al., 2021): "Systematic Generalization with Edge Transformers"
(Wold et al., 19 May 2025): "Systematic Generalization in LLMs Scales with Information Entropy"
(Rio et al., 27 Feb 2025): "Data Distributional Properties As Inductive Bias for Systematic Generalization"
(Rahimi et al., 2023): "D3: Data Diversity Design for Systematic Generalization in Visual Question Answering"
(Ruis et al., 2020): "A Benchmark for Systematic Generalization in Grounded Language Understanding"
(Bahdanau et al., 2019): "CLOSURE: Assessing Systematic Generalization of CLEVR Models"
(D'Amario et al., 2021): "How Modular Should Neural Module Networks Be for Systematic Generalization?"
(Bahdanau et al., 2018): "Systematic Generalization: What Is Required and Can It Be Learned?"
(Yanaka et al., 2021): "SyGNS: A Systematic Generalization Testbed Based on Natural Language Semantics"
(Mutti et al., 2022): "Provably Efficient Causal Model-Based Reinforcement Learning for Systematic Generalization"
(Mondorf et al., 2 Apr 2025): "Enabling Systematic Generalization in Abstract Spatial Reasoning through Meta-Learning for Compositionality"
(Li, 2022): "A Short Survey of Systematic Generalization"
(Bansal et al., 2021): "Systematic Generalization in Neural Networks-based Multivariate Time Series Forecasting Models"