Compositional Generalization in Object-Oriented Environments

Updated 14 January 2026

Compositional generalization is defined as an agent's ability to recombine object-centric representations, relations, or substructures to interpret novel scenes and tasks.
Architectural mechanisms, such as slot attention, permutation-invariant processing, and slot-wise decoding, enable models to achieve additive and modular recombination of object features.
Empirical benchmarks demonstrate improved reconstruction accuracy and policy generalization, while also highlighting challenges in handling occlusion and multi-modal real-world environments.

Compositional generalization in object-oriented environments denotes an agent’s capacity to flexibly recombine familiar object-centric representations, relations, or substructures to interpret, predict, or act in novel scenes, tasks, or concept configurations that are out-of-distribution relative to the training regime. Achieving robust generalization across novel object types, counts, or combinations—without exhaustive retraining or memorization—constitutes a core challenge for neural networks and forms a critical bridge between symbolic reasoning and perceptual learning. This article surveys formal definitions, model architectures, learning frameworks, benchmarks, and empirical findings characterizing state-of-the-art compositional generalization in object-oriented settings.

1. Formal Definitions and Algebraic Frameworks

Compositional generalization is mathematically formalized as the model’s ability to extend from a training support consisting of all marginal configurations of primitive “slots” (factors or objects) to the full combinatorial product (unseen compositions or arrangements).

Let $Z = Z_1 \times \cdots \times Z_K$ be a latent space factored into $K$ slots, and let $G: Z \to X$ denote the ground-truth scene generator. The training set $S \subseteq Z$ is “slot-supported” if it covers every configuration of each slot individually but only some inter-slot (joint) compositions. An autoencoder $(f, g)$ achieves compositional generalization if it can faithfully reconstruct all $G(z)$ for any $z \in Z$ , not just those in $S$ . This is guaranteed under the following conditions:

The decoder $g$ is additive and compositional: $g(z) = \sum_{k=1}^K g_k(z_k)$ , with each $K$ 0 slot-specific.
The encoder–decoder pair $K$ 1 is slot-identifiable, i.e., there exists a permutation $K$ 2 and slot-wise diffeomorphisms $K$ 3 such that $K$ 4 for $K$ 5.
Consistency regularization ensures $K$ 6 is identity not only on training points but also on slot-recombined latent codes (Wiedemer et al., 2023).

For sequential decision processes, compositionality is formalized via MDP homomorphisms: there exists a mapping $K$ 7 between a full object-library MDP $K$ 8 and a canonical, slot-indexed MDP $K$ 9 such that the transition and reward structure is preserved up to object-permutation (symmetric group $G: Z \to X$ 0). Compositional generalization then requires the learned transition model $G: Z \to X$ 1 to be equivariant under $G: Z \to X$ 2 (Zhao et al., 2022).

2. Architectural Mechanisms for Compositionality

Object-oriented models achieve compositional generalization through explicit architectural factorization:

Object-centric Representation: Each object $G: Z \to X$ 3 is encoded independently, typically via object detectors, slot attention modules, or bounding-box proposals.
Interaction Modules: Pairwise (or higher-order) interactions are handled additively, as in the Neural Physics Engine (NPE), where the dynamics of object $G: Z \to X$ 4 depend on learned embeddings of $G: Z \to X$ 5 pairs within a spatial neighborhood (Chang et al., 2016).
Permutation-invariant (or equivariant) Processing: Graph Neural Networks (GNNs) are employed over object-centric graphs to guarantee all objects are treated homogeneously and the network remains oblivious to the absolute ordering of slots or object identities.
Slot-wise Decoding: Independent decoders reconstruct each object (or part) in isolation before recombination, as in Slot Attention architectures (Montero et al., 2024), guaranteeing modularity.

In language–vision models, CLIP-style architectures achieve compositionality via contrastive objectives and language-induced disentanglement: text prompts “a rusty elephant” factorize into object and attribute subspaces, and both encoders are trained to maximize mutual information across these factors (Abbasi et al., 2024).

3. Empirical Benchmarks and Evaluation Protocols

The study of compositional generalization leverages testbeds designed to probe systematic recombination:

Held-out Combinations: Models are exposed during training to all atoms (objects, attributes), but some combinations (pairs, triples) are held out and used exclusively for evaluation (e.g., “heart” shapes at unseen rotations in dSprites, or attribute-object pairs in ImageNet-AO) (Montero et al., 2024, Abbasi et al., 2024).
Scene Construction: ConceptWorld and Object Library generate images or MDPs through programmatic DSLs, controlling compositional depth and atom substitutions (e.g., scaling up shape types, swapping relational partners) (Klinger et al., 2020, Zhao et al., 2022).
Zero-shot Category and Count Generalization: 3D shape prediction pipelines are trained on one category (e.g., chairs) and tested on structurally diverse categories (e.g., beds, cabinets) to assess transfer of part-based predictions (Han et al., 2020).
Compositional Control and Policy Learning: RL environments require policies that scale with object number or arrangement (e.g., Multi-MNIST, Pacman, BigFish, FallingDigit), evaluating reward or classification metrics for scale-up and background distractors (Mu et al., 2020).

Typical metrics include reconstruction error (MSE), generalization gap (train vs. OOD accuracy), ability to infer latent object properties, and policy reward under novel scene layouts.

4. Theoretical Guarantees and Inductive Bias

Provable compositional generalization is only attainable under strong architectural and training assumptions:

The decoder must be both compositional (disjoint influence per pixel/slot) and additive (outputs summed, not blended).
Encoder–decoder consistency across OOD slot re-combinations is essential; without it, models may arbitrarily entangle slot information, breaking generalization (Wiedemer et al., 2023).
Slot attention mechanisms relying on softmax masking violate additivity via inter-slot competition; substituting sigmoid gating partially restores compositionality.
Permutation-equivariant GNNs and attention-based action–slot bindings in world models critical for scaling equivariance from $G: Z \to X$ 6 to the much larger $G: Z \to X$ 7 ambient group (Zhao et al., 2022).
Factorizing the learning pipeline into unary and pairwise modules (e.g., for object parts, relations) supplies stable components for transfer, provided the primitives are shared across categories (Han et al., 2020).

Limitations arise when object occlusion, attribute entanglement, or domain-specific mixing (e.g., shadows, context-sensitive background) break the independence assumptions of the decoder.

5. Empirical Findings: Strengths and Limitations

Recent work demonstrates that object-centric and slot-structured models dramatically outperform both “disentangled” VAEs and pure relational networks on compositional OOD tasks (Montero et al., 2024, Wiedemer et al., 2023):

In scene and property recombination (e.g., shape × color), Slot Attention models maintain visually faithful reconstructions with low pixel error ( $G: Z \to X$ 8 on 3DShapes), compared to baseline VAE errors an order of magnitude higher.
For policy generalization, GNN-based students with self-supervised object proposals generalize to larger object counts and novel arrangements, with accuracy scaling from 30% (CNN) to >80% (GNN with ground-truth boxes) (Mu et al., 2020).
In vision–language, CLIP’s compositional OOD accuracy on unseen attribute-object pairs reaches ≈60% if trained with hundreds of millions of captions, while image-supervised models fail entirely, performing at chance (Abbasi et al., 2024).
The NPE achieves sustained high similarity metrics in physical prediction and is capable of inferring latent properties (e.g., mass estimation with ~90% accuracy) through compositional transfer (Chang et al., 2016).

Failure modes consistently include: limited generalization when compositional slot coverage is insufficient, non-additive decoders permit interference, slot identities drift due to random initialization, and occlusion or pixel mixing violates the clean object-separability assumption (Wiedemer et al., 2023, Montero et al., 2024).

6. Open Problems, Directions, and Implications

Despite advancements, several fundamental challenges in compositional generalization for object-oriented environments remain:

Extension to Occlusive, Relational, and Multi-Modal Scenes: Current theoretical guarantees require non-overlapping objects and disjoint pixel influence; occlusion and lighting effects in natural vision remain open.
Scalable Equivariance: Exact permutation-equivariant models are memory-intractable for large object libraries; leveraging soft or differentiable action-slot bindings is critical but introduces approximation error (Zhao et al., 2022).
Discrete Symbolic Abstraction: Slot embeddings are not yet linearly decodable into factors; bridging the gap to symbolic reasoning or neuro-symbolic interfaces is an active research area (Montero et al., 2024).
Language Supervision and Factor Disentanglement: Richer, more diverse textual data improves compositionality in vision–LLMs (Abbasi et al., 2024); the design of datasets and curricula for supervised factor disentanglement is non-trivial.
Benchmark Development: More challenging benchmarks incorporating real-world images, richer object attributes, and deep relational structure (e.g., program induction, hierarchical concepts) are imperative (Klinger et al., 2020).

A plausible implication is that architectural design privileging object-centric, slot-wise, and permutation-invariant computation is a necessary—though not sufficient—precondition for robust compositional generalization. Hybrid models that incorporate figure–ground grouping, Gestalt principles, and explicit symbolic structure are a likely trajectory for further advances.

References:

(Chang et al., 2016) A Compositional Object-Based Approach to Learning Physical Dynamics
(Wiedemer et al., 2023) Provable Compositional Generalization for Object-Centric Learning
(Montero et al., 2024) Successes and Limitations of Object-centric Models at Compositional Generalisation
(Zhao et al., 2022) Toward Compositional Generalization in Object-Oriented World Modeling
(Han et al., 2020) Compositionally Generalizable 3D Structure Prediction
(Mu et al., 2020) Refactoring Policy for Compositional Generalizability using Self-Supervised Object Proposals
(Abbasi et al., 2024) Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP
(Klinger et al., 2020) A Study of Compositional Generalization in Neural Models