Object-Centric Representation Learning

Updated 12 June 2026

Object-centric representation learning is a framework that decomposes scenes into independent object slots to enable modularity and systematic generalization.
It employs architectures such as slot-attention and transformer-based models to iteratively assign latent vectors to distinct objects for robust scene reconstruction.
The approach supports practical applications in vision, robotics, and multi-modal learning by enhancing identifiability, sample-efficiency, and compositional reasoning.

Object-centric representation learning is an unsupervised and semi-supervised approach for decomposing visual (and, increasingly, physical) environments into structured sets of latent variables, each associated with an individual object. By inducing models to represent scenes as compositions of object-centric slots, this paradigm aims to produce modular, disentangled, and interpretable scene representations, supporting systematic generalization, sample-efficient downstream tasks, and compositional reasoning. Object-centric learning spans vision, robotics, and multi-modal machine learning, and has evolved rapidly to encompass diverse architectures, theoretical analyses, and practical tools for real- and synthetic-world domains.

1. Core Principles, Motivation, and Definitions

At its foundation, object-centric representation learning treats scenes not as monolithic pixel configurations or distributed feature vectors, but as structured sets of entities—each an individual object—endowed with their own latent variables (slots). This operationalizes a central tenet of cognitive science: compositionality in high-level perception and reasoning is enabled by object-level abstraction (Dittadi et al., 2021, Rubinstein et al., 9 Apr 2025).

Key definitions and desiderata:

Slot: a fixed-dimensional latent vector $z_k$ intended to encapsulate the properties of one object.
Object-centric decomposition: expressing an image $x$ (or a video, or a point cloud) as a set $\{z_1, ..., z_K\}$ , such that each $z_k$ encodes one object and is disentangled from others.
Modularity: slots can be manipulated independently for compositional reasoning and robust prediction.
Identifiability: ideally, the mapping from pixels to object representations is unique up to permutation and possible affine transformation of slot coordinates (see (Brady et al., 2023, Kori et al., 2024)).

Motivations include systematic generalization, OOD robustness (by isolating objects from context), sample-efficient composition, and alignment with cognitive models of perception (Dittadi et al., 2021, Rubinstein et al., 9 Apr 2025).

2. Model Classes and Slot-based Architectures

Unsupervised slot decoders and attention:

Canonical approaches utilize an encoder-decoder architecture with a slot-centric bottleneck, typically implemented via slot attention (Dittadi et al., 2021, Didolkar et al., 27 Mar 2025, Vikström et al., 2022). The encoder maps images to patchwise or pixelwise features, which are then iteratively allocated to $K$ object slots via attention-mediated or EM-style clustering (Kori et al., 2024). Each slot is decoded to reconstruct its object mask and appearance. The most influential families are:

Slot-Attention models (Locatello et al.): recurrent attention iteratively binds features to slots (Dittadi et al., 2021, Vikström et al., 2022, Didolkar et al., 2024).
MONet/GENESIS: sequential attention using "scope" masks and autoregressive priors (Dittadi et al., 2021).
Discrete/Transformer-based models: incorporate discrete codebooks and transformer decoders for textured or video scenes (Zhao et al., 2024, Vikström et al., 2022).
Cycle-consistent GAN-based models (ORGAN): map between images and slot-lists via cycle-consistent adversarial training, achieving scalability on low-contrast, dense scenes (Küchler et al., 2 Mar 2026).
Energy-based models (EGO): permutation-invariant energies over slot sets, with inference via gradient-based MCMC (Zhang et al., 2022).
Probabilistic slot-attention: mixture-prior over slots, EM-updates, and identifiability up to permutation/affine transformation (Kori et al., 2024).

Language and Control:

Models such as CTRL-O allow steerable slot discovery, using language embeddings to guide slot allocation and bind slots to user queries, implemented via language-conditioned slot initialization and contrastive loss (Didolkar et al., 27 Mar 2025).
Language-mediated approaches like LORL integrate neuro-symbolic executors and pre-trained semantic parsers for slot-to-concept alignment (Wang et al., 2020).

Temporal and 3D extensions:

DyMON and related models introduce object-centric factorization in spatiotemporal settings, disentangling scene and observer motion for video or multi-view learning (Nanbo et al., 2021), while 3D extensions focus on point-cloud or volumetric object feature encoding for tasks such as scene graph prediction (Heo et al., 6 Oct 2025).

3. Theoretical Guarantees and Identifiability

A central conceptual advance is the clarification of under what conditions object-centric representations are identifiable from data (Brady et al., 2023, Kori et al., 2024).

Compositionality and irreducibility: Slot identifiability (up to permutation of slots and per-slot invertible transformations) is guaranteed when the generative process satisfies (i) pixel-level compositionality (each pixel depends on at most one object slot) and (ii) irreducibility (no slot can be split into independent sub-objects) (Brady et al., 2023).
Probabilistic Slot Attention (Kori et al., 2024) formalizes identifiability of unsupervised slot-based models with mixture priors under mild invertibility/injectivity assumptions about the decoder.
Permutation invariance: Both EBMs and transformer-based attention models enforce permutation invariance over slots by design, preventing degenerate slot binding (Zhang et al., 2022, Kori et al., 2024).
Limitations: Real-world deviations from compositionality (e.g., transparency, shadows, articulated objects) can break theoretical guarantees, motivating research into robust and adaptive slot assignment.

4. Methodologies: Losses, Training, and Evaluation

Object-centric models are trained predominantly in a self-supervised regime, with the following general methodology:

Reconstruction loss: pixelwise $\ell_2$ or feature-matching objectives (e.g. DINO, CLIP features) as a proxy for instance grouping (Didolkar et al., 27 Mar 2025, Didolkar et al., 2024, Vikström et al., 2022).
Attention/slot bottleneck: enforces object factorization—often no explicit clustering loss required; the bottleneck suffices.
Contrastive and control losses: language-conditioned or user-guided methods rely on contrastive InfoNCE objectives to bind slots to external queries (Didolkar et al., 27 Mar 2025, Wang et al., 2020).
Discrete codebooks and attribute grouping: transformer-based models (e.g., GDR) utilize grouped codebooks indexed by attribute tuples for scalable compositionality (Zhao et al., 2024).
Energy-based training: MCMC-based inference in slot space with reconstruction supervision (Zhang et al., 2022).
GAN-based cycle consistency: ORGAN employs both adversarial and cycle-consistency (image ↔ list) losses, with losses defined to handle slot permutation via assignment (Küchler et al., 2 Mar 2026).

Evaluation metrics:

Unsupervised object discovery: Foreground ARI (FG-ARI), mean Best Overlap (mBO), and mIoU by instance matching (Didolkar et al., 2024, Rubinstein et al., 9 Apr 2025, Vikström et al., 2022).
Downstream property prediction: slot features as input to linear/MLP regressors/classifiers for object attributes, matching via the Hungarian algorithm (Dittadi et al., 2021).
Segmentation and referential QA: mask alignment, referring expression segmentation, and VQA accuracy (Didolkar et al., 27 Mar 2025, Wang et al., 2020).
Zero-shot transfer: benchmarked across multiple datasets with variable object count, texture, and background (Didolkar et al., 2024, Rubinstein et al., 9 Apr 2025).

5. Generalization, Robustness, and Scaling

Object-centric representations have empirically demonstrated improved robustness and transfer over monolithic or distributed encodings:

OOD robustness: Slot-based models maintain downstream accuracy when one object is shifted out-of-distribution (novel color/texture) while other slots and predictions are unaffected; larger global scene shifts (e.g., cropping, added clutter) present ongoing challenges (Dittadi et al., 2021, Rubinstein et al., 9 Apr 2025, Zhang et al., 2022).
Scalable zero-shot generalization: Training on large, diverse real-world datasets (e.g., COCO, EntitySeg) yields models that transfer across synthetic, hybrid, and real datasets without fine-tuning, with fine-tuned ViT backbones outperforming (even supervised) segmentation models on grouping metrics (Didolkar et al., 2024, Rubinstein et al., 9 Apr 2025).

Recent foundational segmentation models (SAM, HQES) demonstrate that pixel-space object-centric decomposition can, in some cases, outperform slot-based models in both zero-shot OOD discovery and robustness (Rubinstein et al., 9 Apr 2025). However, diagnosis of true foreground/background separation remains a central challenge in training-free pipelines (OCCAM probe) (Rubinstein et al., 9 Apr 2025).

6. Extensions: Language, 3D, Control, and Online Learning

The field has rapidly broadened to encompass:

Language-guided OCL: CTRL-O and LORL establish architectures which couple slot assignments to natural language, enabling instance-specific binding and downstream reasoning (VQA, referring expression segmentation) via contrastively aligned slots (Didolkar et al., 27 Mar 2025, Wang et al., 2020).
3D and scene graphs: OCL is extended to point clouds and volumetric scenes with geometric-semantic fusion, yielding improved scene graph accuracy and relationship prediction when pretrained object encoders are used (Heo et al., 6 Oct 2025).
Temporal and motion factorization: Models such as DyMON differentiate object and observer dynamics, supporting independent time/view querying for objects (Nanbo et al., 2021).
Online and continual learning: Object Pursuit employs latent codes and hypernetworks to generate discriminative weights for each object appearance, enabling continual object learning with re-identification and anti-forgetting (Pan et al., 2021). Interactive robot-table-top learning achieves efficient, robust online GP inference in object-centric frames (Shinde et al., 2023).

7. Current Limitations and Prospective Directions

Despite theoretical and empirical advances, open problems persist:

Foreground selection and grounding: Pixel-space segmentation models can mask but not necessarily select the task-relevant object in a training-free regime; robust OOD and saliency-aware mask scoring remain unsolved (Rubinstein et al., 9 Apr 2025).
Slot assignment and variable $K$ : Automatically inferring the number of objects (slots), especially for open-world scenes or variable-structured data, is non-trivial (Brady et al., 2023, Didolkar et al., 2024).
Small-object and part discovery: Hierarchical and multi-scale attention or reverse hierarchy feedback (RHGNet) are critical for recovering rare, low-saliency entities (Zou et al., 2024).
Neural binding and compositional abstraction: Explicit part-whole relationships, physical/causal reasoning, and finer granularity (parts/limbs) require future progress in multi-task and hierarchical OCL (Didolkar et al., 27 Mar 2025, Didolkar et al., 2024, Zou et al., 2024).
Theoretical gap: Ensuring slot identifiability, permutation invariance, and interpretable binding in realistic, noisy, or partially observable environments remains a central mathematical challenge (Brady et al., 2023, Kori et al., 2024).
Integration with reasoning: True synergy with symbolic, memory-based, and graph reasoners has begun with frameworks for video QA and neuro-symbolic execution, but requires more mature methods for scalable integration (Dang et al., 2021, Wang et al., 2020).
Bridging cognitive and computational objectness: Current pipelines mirror but do not fully capture developmental cues (motion, multimodal grounding, human-in-the-loop interactions) underlying human object perception (Rubinstein et al., 9 Apr 2025).

Future directions include segmentation-backed self-supervised representation learning, multimodal OCL incorporating language, video, and depth, adaptive slot allocation, robust unsupervised foreground selection, and the design of compositional learning benchmarks measuring relational, causal, and model-based generalization (Rubinstein et al., 9 Apr 2025, Zhao et al., 2024, Didolkar et al., 27 Mar 2025). The field is now equipped with large-scale toolboxes and foundational models, suggesting the frontier has shifted from raw object separation to harnessing object-centric representations for robust, compositional, and cognitively-inspired artificial intelligence.