Object-Centric Latent Action Learning

Updated 14 April 2026

Object-centric latent action learning is a framework that decomposes high-dimensional observations into object-level representations and encodes low-dimensional latent actions for robust policy and predictive models.
It employs methods such as Slot Attention, inverse dynamics models, and flow-based pseudo-labeling to disentangle action-relevant signals from distracting background information.
Empirical insights show enhanced sample efficiency, compositional generalization, and interpretable skill discovery, enabling few-shot learning and hierarchical planning in complex environments.

Object-centric latent action learning refers to a family of frameworks that leverage structured visual or state representations, typically at the object-level, to infer, encode, and utilize low-dimensional latent actions for policy learning, planning, and prediction in embodied agents. Rooted in advances in unsupervised object-centric scene decomposition, this paradigm seeks to disentangle action-relevant dynamics from background distractors, yielding representations that are robust, interpretable, and sample-efficient across imitation learning (IL), reinforcement learning (RL), and goal-conditioned tasks.

1. Foundations and Problem Formulation

Object-centric latent action learning posits that the agent’s sensory experience—often high-dimensional raw pixels—admits a factorization into an object-level representation space. Specifically, at each timestep, images or states $o_t$ are mapped to sets of “slots” or “particles” $\{s^k_t\}$ corresponding to objects, object parts, or keypoints. The latent action $z_t$ is defined as a compact vector mediating transitions between object-centric states, i.e.,

$\{s^k_t\},\,z_t \rightarrow \{s^k_{t+1}\}\,.$

The latent $z_t$ may encode goal-conditioned affordances, inverse-dynamics explanations, or interaction primitives between objects. Control policies are then trained over the latent space—directly imitating or planning in the space of $z_t$ instead of from pixels or privileged states. This methodology addresses two key issues: minimizing the impact of action-irrelevant distractors (e.g., background motion, lighting) and dramatically reducing the need for dense action labeling via self-supervision or pseudo-labels (Klepach et al., 13 Feb 2025).

2. Core Methodologies

Object-centric scene decomposition

Nearly all frameworks begin with an object-centric encoder, e.g., Slot Attention (Mosbach et al., 2024, Villar-Corrales et al., 11 Feb 2025), SAVi, transformer-based particles (Daniel et al., 4 Mar 2026), or spatial softmax keypoints. These modules decompose an observation $o_t$ into $K$ slots/particles, each embedding spatial, appearance, and/or dynamic information of a candidate object.

Slot Attention: Iterative cross-attention assigns pixels/features to slots, producing $S_t = [s^1_t, ..., s^K_t]$ , each ideally tracking a consistent object.
Particle Models: Patches or regions in the scene are processed to obtain keypoints, bounding boxes, masks, and per-object appearance/dynamics attributes (Daniel et al., 4 Mar 2026).

Latent action inference

Inverse Dynamics Models (IDMs) are trained to output $z_t$ given current and next object-centric states, inferring a compact explanation of observed transitions.
In unsupervised settings, $\{s^k_t\}$ 0 is learned via self-consistency or pseudo-supervision, e.g., matching optical flow (Bu et al., 20 Nov 2025) or slot transitions. When action labels $\{s^k_t\}$ 1 are available, supervised heads map $\{s^k_t\}$ 2 for policy decoding or fine-tuning (Klepach et al., 13 Feb 2025).
Latent Policy Priors: World models often maintain priors $\{s^k_t\}$ 3 for sampling plausible actions during planning or video generation (Daniel et al., 4 Mar 2026).

Dynamics models

Latent Dynamics: Forward models (FDMs) consume $\{s^k_t\}$ 4 and predict $\{s^k_t\}$ 5, allowing imagination or planning entirely within object-centric space (Mosbach et al., 2024, Villar-Corrales et al., 11 Feb 2025).
Interaction Modeling: Some models explicitly infer or factor object–object interaction graphs to disentangle object-specific versus relational/interaction dynamics. FIOC-WM (Feng et al., 4 Nov 2025) uses variational graph inference and conditional independence testing to learn factored priors and interaction structure.

Pseudo-labels from motion cues

Optical Flow Masking: LAOF (Bu et al., 20 Nov 2025) uses RGB-formatted optical flow as a pseudo-label for agent motion, ensuring latent actions are action-aligned and robust to background distractors. Segmentation masks (LangSAM) restrict flow supervision to the agent, enhancing the object-centricity of $\{s^k_t\}$ 6.

3. Representative Architectures

Model	Object Encoder	Latent Action Module	Downstream use
LAOF (Bu et al., 20 Nov 2025)	DINOv2 + optical flow, mask	Spatio-temporal transformer IDM, flow decoder	Imitation learning, RL, label-scarce adaptation
SOLD (Mosbach et al., 2024)	CNN + Slot Attention	Slot-based transformer dynamics	RL, multi-object relational reasoning
PlaySlot (Villar-Corrales et al., 11 Feb 2025)	SAVi + Slot Attention	Per-slot invertible action modules (VQ)	Controllable prediction, planning
LPWM (Daniel et al., 4 Mar 2026)	Patchwise keypoints + masks	Per-particle stochastic latent action	Stochastic video modeling, control
FIOC-WM (Feng et al., 4 Nov 2025)	Pretrained ViT + Slot Attn	Interaction-structured dynamic slots	Hierarchical policy learning
OC-LALO (Klepach et al., 13 Feb 2025)	VideoSAUR + Slot Attn	Slot-wise FDM/IDM (proxy action labels)	Imitation learning from video

Variants may emphasize slot-based deterministic encodings (e.g., SOLD), probabilistic/variational models (e.g., LPWM, DLPWM), or leverage object-centric affordance learning via segmentation and affordance prediction (e.g., PLATO (Belkhale et al., 2022)).

4. Losses, Supervision, and Training Protocols

Key loss functions across this literature include:

State reconstruction: $\{s^k_t\}$ 7, ensuring world models produce faithful rollouts in object space (Bu et al., 20 Nov 2025, Mosbach et al., 2024).
Optical-flow/transition consistency: $\{s^k_t\}$ 8, aligning latent transitions to measured motion (Bu et al., 20 Nov 2025).
Pseudo-action supervision: $\{s^k_t\}$ 9, applied when labels are available.
KL divergence over latent actions and states for variational models (Daniel et al., 4 Mar 2026, Ferraro et al., 8 Nov 2025).
Object-centric mask/slot regularization to encourage disentanglement and ignore distractors (Klepach et al., 13 Feb 2025).

Supervision strategies vary:

Fully unsupervised: using only video, object masks, or optical flow signals.
Pseudo-supervised: using motion-derived labels or proxy actions.
Weakly/few-shot supervised: minimal action labels enhance mapping from latent $z_t$ 0 to $z_t$ 1.

Pseudo-labeling via motion (e.g., object-centric flow) is especially effective in label-poor settings (Bu et al., 20 Nov 2025).

5. Empirical Insights, Strengths, and Limitations

Empirical studies consistently find that object-centric latent action learning enhances:

Robustness to distractors: Object-centric masking and flow-based constraints yield 2–3 $z_t$ 2 improvements in proxy-action quality and BC success rates over pixel-centric methods in the presence of dynamic backgrounds (Klepach et al., 13 Feb 2025).
Interpretability: Slot and particle latents bind to consistent objects/parts, enabling per-object and relational policy analysis (Mosbach et al., 2024, Daniel et al., 4 Mar 2026).
Sample efficiency: Few-shot label regimes (1–10%) suffice to match fully supervised baselines in both RL and IL on complex benchmarks (e.g., LIBERO, PROCGEN) (Bu et al., 20 Nov 2025).
Compositionality and long-horizon planning: Decomposing policy and world models into object-level dynamics and interaction primitives promotes compositional generalization (novel object combinations, temporally extended skills) (Feng et al., 4 Nov 2025, Mosbach et al., 2024).
Transfer: Distilled object-centric latent action models generalize across robot embodiments with strong performance in real-world manipulation from few demonstrations (Li et al., 28 Nov 2025).

Relevant metrics include MSE on action prediction, success rate in multi-object manipulation, LPIPS/FVD for video prediction, and policy learning curves/returns across unseen attribute or relational settings (Bu et al., 20 Nov 2025, Mosbach et al., 2024, Daniel et al., 4 Mar 2026).

6. Variants, Extensions, and Analysis

Interaction learning and hierarchy

FIOC-WM (Feng et al., 4 Nov 2025) explicitly models interaction graphs, learning both object slot latents and adjacency structures (via variational masks or conditional MI). The resulting primitives correspond to subgoals ("push A to B"), allowing hierarchical policies: a high-level module sequences latent interaction goals, while low-level controllers execute them in object space.

Intrinsic motivation

Object-centric latent action models also serve as the basis for intrinsic motivation and curriculum building. By tracking learning progress across distinct object-action-outcome regions, agents self-organize their exploration to stage-wise skill emergence, matching observed trajectories in human development (Sener et al., 2020).

Multimodal and language grounding

Slot- or particle-based world models are being integrated with LLMs for language-guided planning and simulation. Conditioning generative models on language-embedded instructions enables flexible, goal-directed object manipulation (Jeong et al., 8 Mar 2025).

Failure modes and ongoing challenges

Latent drift (slot or particle identity switching or jitter around contact events) destabilizes policy learning, as shown in representation shift analyses (Ferraro et al., 8 Nov 2025). Regularization (EMA of slots; strong slot-identity priors; denoising objectives) and end-to-end finetuning are under investigation to enhance control stability.

Limitations persist in scaling to uncurated video, dealing with highly complex scenes (slot assignment errors), learning richer interaction primitives, and transferring to new object categories without strong object priors (Klepach et al., 13 Feb 2025, Daniel et al., 4 Mar 2026). Performance may degrade with high proportions of noisy pseudo-labels; label sweeps indicate a regime where flow or mask pseudo-supervision is maximally beneficial (up to $z_t$ 310% action labels) (Bu et al., 20 Nov 2025).

7. Outlook and Impact

Object-centric latent action learning is emerging as a unifying mechanism for robust, scalable policy and predictive model training in embodied AI. By providing reusable, interpretable, and transferable object-level abstractions, these methods enable:

Zero/few-shot agent adaptation in real-world manipulation and compositional multi-object environments.
Compositional generalization across object sets, tasks, and modalities (vision, language, action).
Efficient model-based RL and planning with tractable sample complexity.
Interpretable skill discovery for curriculum and lifelong learning.

Ongoing research is focused on: scaling up to large-scale real video, richer dynamic scene changes, free-form particle/object tracking, unified multimodal context (action, language, audio), and on-policy end-to-end learning with explicit task rewards (Daniel et al., 4 Mar 2026, Bu et al., 20 Nov 2025, Li et al., 28 Nov 2025, Mosbach et al., 2024, Klepach et al., 13 Feb 2025).