Latent Action Models Overview

Updated 23 February 2026

Latent Action Models are self-supervised frameworks that learn compact, latent action representations from video data by mapping observable transitions to underlying agent-driven changes.
They employ inverse dynamics encoders and forward dynamics models to isolate minimal, action-relevant features while filtering out distractors, thereby enhancing imitation and transfer learning.
Integrated into world models and vision-language-action systems, these models enable efficient policy learning, robust planning, and scalable control across diverse tasks and environments.

A latent action model is a self-supervised approach for inferring an internal, low-dimensional action representation from observation-only data, primarily video, with no or weak access to explicit action labels. Latent Action Models (LAMs) and their generalizations provide critical interface layers in recent world models, end-to-end vision-language-action (VLA) models, offline reinforcement learning agents, and generative video planners. The latent action space captures the agent-driven, controllable aspects of inter-frame visual transitions while filtering out irrelevant or confounding factors (distractors), thereby supporting efficient imitation, transfer, policy learning, and generalization across tasks, embodiments, and data sources.

1. Core Principles and Mathematical Formulation

Latent Action Models postulate that observable transitions between consecutive high-dimensional observations, $o_t \rightarrow o_{t+1}$ , are mediated by unobserved ("latent") actions $z_t$ . The canonical learning setup involves:

An inverse dynamics encoder (IDM) $z_t = E(o_t, o_{t+1})$ mapping frame pairs to latent actions.
A forward dynamics model (FDM) $\hat{o}_{t+1} = D(o_t, z_t)$ reconstructing the next observation from the previous frame and the latent action.

The training objective is commonly an MSE or feature-space reconstruction loss: $\mathcal{L}_{\rm recon} = \mathbb{E}_t\left[ \| D(o_t, E(o_t, o_{t+1})) - o_{t+1} \|^2 \right]$ To prevent trivial solutions (e.g. copying the next frame), capacity bottlenecks are imposed—via low latent dimensionality, quantization (VQ-VAE codebooks), information bottlenecks, or regularizing priors. The latent action $z_t$ is thus forced to capture the minimal, action-driven factors essential for predicting the future state (Nikulin et al., 1 Feb 2025, Bu et al., 20 Nov 2025, Ye et al., 2024, Alles et al., 10 Dec 2025, Cai et al., 30 Sep 2025).

Depending on use case, $z_t$ may be continuous (preferred in high-dimensional or real-world video (Liang et al., 8 May 2025, Garrido et al., 8 Jan 2026, Alles et al., 10 Dec 2025)), discrete (for efficient tokenization (Ye et al., 2024, Chen et al., 31 Jul 2025)), or factored across entities (Wang et al., 18 Feb 2026).

2. Methods for Learning and Grounding Latent Actions

Latent Action Discovery

Learning is entirely self-supervised in the observation-only regime. Techniques include:

Reconstruction-based: Force $z_t$ to encode all agent-driven change necessary for one-step or multi-step predictions (Nikulin et al., 1 Feb 2025, Alles et al., 10 Dec 2025).
Optical flow constraints: Use dense inter-frame flow (e.g., RAFT) as pseudo-supervised targets to enhance action-relevance and robustness to distractors (Bu et al., 20 Nov 2025, Bi et al., 15 Dec 2025).
Prompted VLM embeddings: Use vision-LLMs (VLMs) to provide task-centric semantic embeddings as targets, focusing the representation on controllable change (Nikulin et al., 30 Jan 2026).
Segmentation masking: Suppress gradients on background pixels to disentangle latent actions from action-correlated distractors (Adnan et al., 2 Feb 2026).

Grounding to Real Actions

Once a latent space is established, supervised grounding (even with minimal action labels) is commonly employed:

A small decoder MLP maps $z_t$ (possibly with $o_t$ ) to real action $a_t$ , learning from few labeled transitions (Liang et al., 8 May 2025, Nikulin et al., 1 Feb 2025, Bi et al., 15 Dec 2025).
Joint pretraining of the action decoder during unsupervised phase greatly improves alignment and sample efficiency (Liang et al., 8 May 2025, Bi et al., 15 Dec 2025).
In the world-model setting, downstream policies are trained in $z_t$ -space and mapped to controls via a learned decoder (Alles et al., 10 Dec 2025, Garrido et al., 8 Jan 2026, Ye et al., 2024, Wang et al., 18 Feb 2026).

3. Architectures and Factorizations

Latent action architectures have diversified:

Standard two-module IDM+FDM pipeline (Nikulin et al., 1 Feb 2025, Bu et al., 20 Nov 2025).
Hierarchical multi-branch or factored setups, as in FLAM, assign independent slot-wise latent actions to multiple entities, enabling disentangled multi-agent dynamics (Wang et al., 18 Feb 2026).
Vision-language-action backbones: Incorporation into VLMs and VLAs as bottleneck or mid-level interface layers; e.g., learnable latent action queries appended to transformer input token sequences (Lian et al., 21 Jan 2026, Chen et al., 31 Jul 2025, Bi et al., 15 Dec 2025, Cai et al., 30 Sep 2025).
Jointly trained Co-Evolving architectures integrate a pre-trained world model and a LAM, enabling co-adaptation through warm-up and end-to-end fine-tuning (Wang et al., 30 Oct 2025).

A common pipeline in vision-language-action models:

Infer a latent action or token $z_t$ from $(o_t, o_{t+\Delta t})$ (via temporal transformer, VQ-VAE, or diffusion VAE).
Condition VLA backbone or diffusion planner on both current context and $z_t$ to predict next frame, plan trajectory, or generate actions (Chen et al., 31 Jul 2025, Bi et al., 15 Dec 2025, Cai et al., 30 Sep 2025).

4. Addressing Distractors and Information Collapse

A persistent challenge is the entanglement of $z_t$ with action-correlated distractors (e.g., moving backgrounds, camera shake). Key solutions include:

Object-centric masking (MaskLAM): Multiply FDM loss with per-pixel segmentation masks to focus gradients on the agent or its manipulated objects (Adnan et al., 2 Feb 2026).
Optical flow loss: Reconciling agent-induced flow with learned $z_t$ ensures action-relevance and suppresses training variance under distractions (Bu et al., 20 Nov 2025, Bi et al., 15 Dec 2025).
Supervision injection: LAOM demonstrates that incorporating even 2.5% action-labeled samples during LAM training robustly aligns $z_t$ and recovers 4–8× downstream performance over unsupervised baselines (Nikulin et al., 1 Feb 2025).
Prompted VLM targets: Conditioning FDM targets on promptable embeddings derived from "ignore background" or "task-centric" VLM queries recovers 6× higher success rates under distractors (Nikulin et al., 30 Jan 2026).
Regularization and data augmentation: Multi-step inverse models, large latent dimensions, and strong data augmentation mitigate capacity collapse (Nikulin et al., 1 Feb 2025, Garrido et al., 8 Jan 2026).

Models without these controls exhibit catastrophic failure in the presence of action-correlated distractors: action alignment and downstream policy success degrade to near-zero, despite seemingly successful reconstruction (Nikulin et al., 1 Feb 2025, Bu et al., 20 Nov 2025, Adnan et al., 2 Feb 2026, Nikulin et al., 30 Jan 2026).

5. Integrating Latent Actions into World Models and VLA Systems

Latent action spaces have become central to the scalability and transferability of large world models and VLA systems:

Unifying action-free and action-conditioned training: Shared latent-action space enables pretraining on web-scale video and efficient RL with sparse control labels (Alles et al., 10 Dec 2025, Garrido et al., 8 Jan 2026, Tharwat et al., 22 Sep 2025).
Vision-language-action fusion: Backbones such as VLMs (e.g., Qwen3-VL, PaliGemma) extend to accept latent-action queries, supporting Bayesian decomposition, information-theoretic regularization, and joint vision-action inference (Lian et al., 21 Jan 2026, Chen et al., 31 Jul 2025, Bi et al., 15 Dec 2025).
Planning via latent spaces: LatentDiffuser formalizes planning as energy-guided sampling in continuous latent space, using score-based diffusion priors, yielding scalable control for both low-dimensional and high-dimensional (Adroit) tasks (Li, 2023).
Factored modeling: Multi-entity scenes require per-object/action factorization; FLAM demonstrates improved representation quality and controllable multi-entity rollouts (Wang et al., 18 Feb 2026).
Downstream transfer: The latent action interface supports few-shot adaptation by learning a small action decoder head with minimal real-world data, enabling robust performance across new environments, objects, and embodiments (Bi et al., 15 Dec 2025, Ye et al., 2024, Chen et al., 31 Jul 2025, Cai et al., 30 Sep 2025).

6. Experimental Outcomes and Benchmarking

Latent action models consistently set state-of-the-art or near-optimal performance across a wide array of simulation and real-world robotic benchmarks. Key results include:

MaskLAM: Up to 4× improvement in downstream control on MuJoCo agents with strong distractor backgrounds; linear probe alignment improved 3× (Adnan et al., 2 Feb 2026).
Optical flow-constrained methods (LAOF, Motus): +11–48% enhancements in OOD and real-robot tasks; action alignment MSE of 0.014 relative to 0.044–0.122 for earlier baselines (Bu et al., 20 Nov 2025, Bi et al., 15 Dec 2025).
Minimal supervision (LAOM): 2.5% action labels yield a 4× increase in normalized returns under strong noise (Nikulin et al., 1 Feb 2025).
Prompted VLM LAMs: 6× increase in downstream task success rate with distractors (Nikulin et al., 30 Jan 2026).
End-to-end world models: LAWM achieves 62.4 normalized return (DeepMind Control Suite) with 5% action labels, outperforming model-based and model-free baselines (Alles et al., 10 Dec 2025).

7. Open Challenges and Future Directions

Research convergence highlights several limitations and avenues:

Scaling to real-world, in-the-wild video: Architectural (causal ViTs, cross-scene controllers), regularization (sparse, noisy latents), and grounding (camera-relative actions) remain active work (Garrido et al., 8 Jan 2026).
Factoring and generalization: Factored LAMs (FLAM) and scene decomposition are essential for multi-agent and complex embodied settings (Wang et al., 18 Feb 2026).
Critic/value modeling for latent plans: Hierarchical/planning models with explicit critics depend on further value-function learning in $z_t$ space (Chen et al., 31 Jul 2025).
Efficient integration with pretrained world generators: Co-evolving architectures avoid redundant training and allow bidirectional adaptation of action space and world model (Wang et al., 30 Oct 2025, Bi et al., 15 Dec 2025).
Zero-shot and sim-to-real transfer hinge on the physical grounding of $z_t$ (via proprioceptive, flow, or scene segmentation losses), robust adaptation protocols, and scaling of training on diverse, large-scale data sources (Bi et al., 15 Dec 2025, Li et al., 28 Nov 2025, Cai et al., 30 Sep 2025).

Latent Action Models, by abstracting agent-induced change from raw sensors and text, have become foundational elements for scalable, efficient, and robust control in vision-language-action learning and world modeling pipelines. Their ongoing evolution continues to close the gap between self-supervised video understanding and universally transferable, controllable robotic agents.