Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent Action Models Overview

Updated 23 February 2026
  • Latent Action Models are self-supervised frameworks that learn compact, latent action representations from video data by mapping observable transitions to underlying agent-driven changes.
  • They employ inverse dynamics encoders and forward dynamics models to isolate minimal, action-relevant features while filtering out distractors, thereby enhancing imitation and transfer learning.
  • Integrated into world models and vision-language-action systems, these models enable efficient policy learning, robust planning, and scalable control across diverse tasks and environments.

A latent action model is a self-supervised approach for inferring an internal, low-dimensional action representation from observation-only data, primarily video, with no or weak access to explicit action labels. Latent Action Models (LAMs) and their generalizations provide critical interface layers in recent world models, end-to-end vision-language-action (VLA) models, offline reinforcement learning agents, and generative video planners. The latent action space captures the agent-driven, controllable aspects of inter-frame visual transitions while filtering out irrelevant or confounding factors (distractors), thereby supporting efficient imitation, transfer, policy learning, and generalization across tasks, embodiments, and data sources.

1. Core Principles and Mathematical Formulation

Latent Action Models postulate that observable transitions between consecutive high-dimensional observations, ot→ot+1o_t \rightarrow o_{t+1}, are mediated by unobserved ("latent") actions ztz_t. The canonical learning setup involves:

  • An inverse dynamics encoder (IDM) zt=E(ot,ot+1)z_t = E(o_t, o_{t+1}) mapping frame pairs to latent actions.
  • A forward dynamics model (FDM) o^t+1=D(ot,zt)\hat{o}_{t+1} = D(o_t, z_t) reconstructing the next observation from the previous frame and the latent action.

The training objective is commonly an MSE or feature-space reconstruction loss: Lrecon=Et[∥D(ot,E(ot,ot+1))−ot+1∥2]\mathcal{L}_{\rm recon} = \mathbb{E}_t\left[ \| D(o_t, E(o_t, o_{t+1})) - o_{t+1} \|^2 \right] To prevent trivial solutions (e.g. copying the next frame), capacity bottlenecks are imposed—via low latent dimensionality, quantization (VQ-VAE codebooks), information bottlenecks, or regularizing priors. The latent action ztz_t is thus forced to capture the minimal, action-driven factors essential for predicting the future state (Nikulin et al., 1 Feb 2025, Bu et al., 20 Nov 2025, Ye et al., 2024, Alles et al., 10 Dec 2025, Cai et al., 30 Sep 2025).

Depending on use case, ztz_t may be continuous (preferred in high-dimensional or real-world video (Liang et al., 8 May 2025, Garrido et al., 8 Jan 2026, Alles et al., 10 Dec 2025)), discrete (for efficient tokenization (Ye et al., 2024, Chen et al., 31 Jul 2025)), or factored across entities (Wang et al., 18 Feb 2026).

2. Methods for Learning and Grounding Latent Actions

Latent Action Discovery

Learning is entirely self-supervised in the observation-only regime. Techniques include:

Grounding to Real Actions

Once a latent space is established, supervised grounding (even with minimal action labels) is commonly employed:

3. Architectures and Factorizations

Latent action architectures have diversified:

A common pipeline in vision-language-action models:

  1. Infer a latent action or token ztz_t from (ot,ot+Δt)(o_t, o_{t+\Delta t}) (via temporal transformer, VQ-VAE, or diffusion VAE).
  2. Condition VLA backbone or diffusion planner on both current context and ztz_t to predict next frame, plan trajectory, or generate actions (Chen et al., 31 Jul 2025, Bi et al., 15 Dec 2025, Cai et al., 30 Sep 2025).

4. Addressing Distractors and Information Collapse

A persistent challenge is the entanglement of ztz_t with action-correlated distractors (e.g., moving backgrounds, camera shake). Key solutions include:

  • Object-centric masking (MaskLAM): Multiply FDM loss with per-pixel segmentation masks to focus gradients on the agent or its manipulated objects (Adnan et al., 2 Feb 2026).
  • Optical flow loss: Reconciling agent-induced flow with learned ztz_t ensures action-relevance and suppresses training variance under distractions (Bu et al., 20 Nov 2025, Bi et al., 15 Dec 2025).
  • Supervision injection: LAOM demonstrates that incorporating even 2.5% action-labeled samples during LAM training robustly aligns ztz_t and recovers 4–8× downstream performance over unsupervised baselines (Nikulin et al., 1 Feb 2025).
  • Prompted VLM targets: Conditioning FDM targets on promptable embeddings derived from "ignore background" or "task-centric" VLM queries recovers 6× higher success rates under distractors (Nikulin et al., 30 Jan 2026).
  • Regularization and data augmentation: Multi-step inverse models, large latent dimensions, and strong data augmentation mitigate capacity collapse (Nikulin et al., 1 Feb 2025, Garrido et al., 8 Jan 2026).

Models without these controls exhibit catastrophic failure in the presence of action-correlated distractors: action alignment and downstream policy success degrade to near-zero, despite seemingly successful reconstruction (Nikulin et al., 1 Feb 2025, Bu et al., 20 Nov 2025, Adnan et al., 2 Feb 2026, Nikulin et al., 30 Jan 2026).

5. Integrating Latent Actions into World Models and VLA Systems

Latent action spaces have become central to the scalability and transferability of large world models and VLA systems:

6. Experimental Outcomes and Benchmarking

Latent action models consistently set state-of-the-art or near-optimal performance across a wide array of simulation and real-world robotic benchmarks. Key results include:

  • MaskLAM: Up to 4× improvement in downstream control on MuJoCo agents with strong distractor backgrounds; linear probe alignment improved 3× (Adnan et al., 2 Feb 2026).
  • Optical flow-constrained methods (LAOF, Motus): +11–48% enhancements in OOD and real-robot tasks; action alignment MSE of 0.014 relative to 0.044–0.122 for earlier baselines (Bu et al., 20 Nov 2025, Bi et al., 15 Dec 2025).
  • Minimal supervision (LAOM): 2.5% action labels yield a 4× increase in normalized returns under strong noise (Nikulin et al., 1 Feb 2025).
  • Prompted VLM LAMs: 6× increase in downstream task success rate with distractors (Nikulin et al., 30 Jan 2026).
  • End-to-end world models: LAWM achieves 62.4 normalized return (DeepMind Control Suite) with 5% action labels, outperforming model-based and model-free baselines (Alles et al., 10 Dec 2025).

7. Open Challenges and Future Directions

Research convergence highlights several limitations and avenues:

  • Scaling to real-world, in-the-wild video: Architectural (causal ViTs, cross-scene controllers), regularization (sparse, noisy latents), and grounding (camera-relative actions) remain active work (Garrido et al., 8 Jan 2026).
  • Factoring and generalization: Factored LAMs (FLAM) and scene decomposition are essential for multi-agent and complex embodied settings (Wang et al., 18 Feb 2026).
  • Critic/value modeling for latent plans: Hierarchical/planning models with explicit critics depend on further value-function learning in ztz_t space (Chen et al., 31 Jul 2025).
  • Efficient integration with pretrained world generators: Co-evolving architectures avoid redundant training and allow bidirectional adaptation of action space and world model (Wang et al., 30 Oct 2025, Bi et al., 15 Dec 2025).
  • Zero-shot and sim-to-real transfer hinge on the physical grounding of ztz_t (via proprioceptive, flow, or scene segmentation losses), robust adaptation protocols, and scaling of training on diverse, large-scale data sources (Bi et al., 15 Dec 2025, Li et al., 28 Nov 2025, Cai et al., 30 Sep 2025).

Latent Action Models, by abstracting agent-induced change from raw sensors and text, have become foundational elements for scalable, efficient, and robust control in vision-language-action learning and world modeling pipelines. Their ongoing evolution continues to close the gap between self-supervised video understanding and universally transferable, controllable robotic agents.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Action Models.