Papers
Topics
Authors
Recent
Search
2000 character limit reached

Object-Centric Latent Action Learning

Updated 14 April 2026
  • Object-centric latent action learning is a framework that decomposes high-dimensional observations into object-level representations and encodes low-dimensional latent actions for robust policy and predictive models.
  • It employs methods such as Slot Attention, inverse dynamics models, and flow-based pseudo-labeling to disentangle action-relevant signals from distracting background information.
  • Empirical insights show enhanced sample efficiency, compositional generalization, and interpretable skill discovery, enabling few-shot learning and hierarchical planning in complex environments.

Object-centric latent action learning refers to a family of frameworks that leverage structured visual or state representations, typically at the object-level, to infer, encode, and utilize low-dimensional latent actions for policy learning, planning, and prediction in embodied agents. Rooted in advances in unsupervised object-centric scene decomposition, this paradigm seeks to disentangle action-relevant dynamics from background distractors, yielding representations that are robust, interpretable, and sample-efficient across imitation learning (IL), reinforcement learning (RL), and goal-conditioned tasks.

1. Foundations and Problem Formulation

Object-centric latent action learning posits that the agent’s sensory experience—often high-dimensional raw pixels—admits a factorization into an object-level representation space. Specifically, at each timestep, images or states oto_t are mapped to sets of “slots” or “particles” {stk}\{s^k_t\} corresponding to objects, object parts, or keypoints. The latent action ztz_t is defined as a compact vector mediating transitions between object-centric states, i.e.,

{stk},zt{st+1k}.\{s^k_t\},\,z_t \rightarrow \{s^k_{t+1}\}\,.

The latent ztz_t may encode goal-conditioned affordances, inverse-dynamics explanations, or interaction primitives between objects. Control policies are then trained over the latent space—directly imitating or planning in the space of ztz_t instead of from pixels or privileged states. This methodology addresses two key issues: minimizing the impact of action-irrelevant distractors (e.g., background motion, lighting) and dramatically reducing the need for dense action labeling via self-supervision or pseudo-labels (Klepach et al., 13 Feb 2025).

2. Core Methodologies

Object-centric scene decomposition

Nearly all frameworks begin with an object-centric encoder, e.g., Slot Attention (Mosbach et al., 2024, Villar-Corrales et al., 11 Feb 2025), SAVi, transformer-based particles (Daniel et al., 4 Mar 2026), or spatial softmax keypoints. These modules decompose an observation oto_t into KK slots/particles, each embedding spatial, appearance, and/or dynamic information of a candidate object.

  • Slot Attention: Iterative cross-attention assigns pixels/features to slots, producing St=[st1,...,stK]S_t = [s^1_t, ..., s^K_t], each ideally tracking a consistent object.
  • Particle Models: Patches or regions in the scene are processed to obtain keypoints, bounding boxes, masks, and per-object appearance/dynamics attributes (Daniel et al., 4 Mar 2026).

Latent action inference

  • Inverse Dynamics Models (IDMs) are trained to output ztz_t given current and next object-centric states, inferring a compact explanation of observed transitions.
  • In unsupervised settings, {stk}\{s^k_t\}0 is learned via self-consistency or pseudo-supervision, e.g., matching optical flow (Bu et al., 20 Nov 2025) or slot transitions. When action labels {stk}\{s^k_t\}1 are available, supervised heads map {stk}\{s^k_t\}2 for policy decoding or fine-tuning (Klepach et al., 13 Feb 2025).
  • Latent Policy Priors: World models often maintain priors {stk}\{s^k_t\}3 for sampling plausible actions during planning or video generation (Daniel et al., 4 Mar 2026).

Dynamics models

  • Latent Dynamics: Forward models (FDMs) consume {stk}\{s^k_t\}4 and predict {stk}\{s^k_t\}5, allowing imagination or planning entirely within object-centric space (Mosbach et al., 2024, Villar-Corrales et al., 11 Feb 2025).
  • Interaction Modeling: Some models explicitly infer or factor object–object interaction graphs to disentangle object-specific versus relational/interaction dynamics. FIOC-WM (Feng et al., 4 Nov 2025) uses variational graph inference and conditional independence testing to learn factored priors and interaction structure.

Pseudo-labels from motion cues

  • Optical Flow Masking: LAOF (Bu et al., 20 Nov 2025) uses RGB-formatted optical flow as a pseudo-label for agent motion, ensuring latent actions are action-aligned and robust to background distractors. Segmentation masks (LangSAM) restrict flow supervision to the agent, enhancing the object-centricity of {stk}\{s^k_t\}6.

3. Representative Architectures

Model Object Encoder Latent Action Module Downstream use
LAOF (Bu et al., 20 Nov 2025) DINOv2 + optical flow, mask Spatio-temporal transformer IDM, flow decoder Imitation learning, RL, label-scarce adaptation
SOLD (Mosbach et al., 2024) CNN + Slot Attention Slot-based transformer dynamics RL, multi-object relational reasoning
PlaySlot (Villar-Corrales et al., 11 Feb 2025) SAVi + Slot Attention Per-slot invertible action modules (VQ) Controllable prediction, planning
LPWM (Daniel et al., 4 Mar 2026) Patchwise keypoints + masks Per-particle stochastic latent action Stochastic video modeling, control
FIOC-WM (Feng et al., 4 Nov 2025) Pretrained ViT + Slot Attn Interaction-structured dynamic slots Hierarchical policy learning
OC-LALO (Klepach et al., 13 Feb 2025) VideoSAUR + Slot Attn Slot-wise FDM/IDM (proxy action labels) Imitation learning from video

Variants may emphasize slot-based deterministic encodings (e.g., SOLD), probabilistic/variational models (e.g., LPWM, DLPWM), or leverage object-centric affordance learning via segmentation and affordance prediction (e.g., PLATO (Belkhale et al., 2022)).

4. Losses, Supervision, and Training Protocols

Key loss functions across this literature include:

Supervision strategies vary:

  • Fully unsupervised: using only video, object masks, or optical flow signals.
  • Pseudo-supervised: using motion-derived labels or proxy actions.
  • Weakly/few-shot supervised: minimal action labels enhance mapping from latent ztz_t0 to ztz_t1.

Pseudo-labeling via motion (e.g., object-centric flow) is especially effective in label-poor settings (Bu et al., 20 Nov 2025).

5. Empirical Insights, Strengths, and Limitations

Empirical studies consistently find that object-centric latent action learning enhances:

Relevant metrics include MSE on action prediction, success rate in multi-object manipulation, LPIPS/FVD for video prediction, and policy learning curves/returns across unseen attribute or relational settings (Bu et al., 20 Nov 2025, Mosbach et al., 2024, Daniel et al., 4 Mar 2026).

6. Variants, Extensions, and Analysis

Interaction learning and hierarchy

FIOC-WM (Feng et al., 4 Nov 2025) explicitly models interaction graphs, learning both object slot latents and adjacency structures (via variational masks or conditional MI). The resulting primitives correspond to subgoals ("push A to B"), allowing hierarchical policies: a high-level module sequences latent interaction goals, while low-level controllers execute them in object space.

Intrinsic motivation

Object-centric latent action models also serve as the basis for intrinsic motivation and curriculum building. By tracking learning progress across distinct object-action-outcome regions, agents self-organize their exploration to stage-wise skill emergence, matching observed trajectories in human development (Sener et al., 2020).

Multimodal and language grounding

Slot- or particle-based world models are being integrated with LLMs for language-guided planning and simulation. Conditioning generative models on language-embedded instructions enables flexible, goal-directed object manipulation (Jeong et al., 8 Mar 2025).

Failure modes and ongoing challenges

Latent drift (slot or particle identity switching or jitter around contact events) destabilizes policy learning, as shown in representation shift analyses (Ferraro et al., 8 Nov 2025). Regularization (EMA of slots; strong slot-identity priors; denoising objectives) and end-to-end finetuning are under investigation to enhance control stability.

Limitations persist in scaling to uncurated video, dealing with highly complex scenes (slot assignment errors), learning richer interaction primitives, and transferring to new object categories without strong object priors (Klepach et al., 13 Feb 2025, Daniel et al., 4 Mar 2026). Performance may degrade with high proportions of noisy pseudo-labels; label sweeps indicate a regime where flow or mask pseudo-supervision is maximally beneficial (up to ztz_t310% action labels) (Bu et al., 20 Nov 2025).

7. Outlook and Impact

Object-centric latent action learning is emerging as a unifying mechanism for robust, scalable policy and predictive model training in embodied AI. By providing reusable, interpretable, and transferable object-level abstractions, these methods enable:

  • Zero/few-shot agent adaptation in real-world manipulation and compositional multi-object environments.
  • Compositional generalization across object sets, tasks, and modalities (vision, language, action).
  • Efficient model-based RL and planning with tractable sample complexity.
  • Interpretable skill discovery for curriculum and lifelong learning.

Ongoing research is focused on: scaling up to large-scale real video, richer dynamic scene changes, free-form particle/object tracking, unified multimodal context (action, language, audio), and on-policy end-to-end learning with explicit task rewards (Daniel et al., 4 Mar 2026, Bu et al., 20 Nov 2025, Li et al., 28 Nov 2025, Mosbach et al., 2024, Klepach et al., 13 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Object-Centric Latent Action Learning.