Latent Action Model: Compact Control Abstractions

Updated 17 December 2025

Latent Action Models are machine learning frameworks that represent control signals as compact latent variables, enabling efficient planning and policy optimization.
Techniques like VQ-VAE and variational inference extract latent actions from state transitions, reducing decision latency in high-dimensional and unstructured action spaces.
Empirical studies across robotics, RL, dialog, and video generation validate LAMs while highlighting challenges such as distractor sensitivity and scaling limits.

A Latent Action Model (LAM) is a machine learning framework in which the actions, policies, or control signals for an agent—whether a robot, dialog system, or generative model—are represented via a learned, often compact latent variable, rather than directly in the original high-dimensional or semantically unstructured action space. LAMs leverage unsupervised, self-supervised, or weakly supervised methods to infer these latent actions from observed transitions, sequences, or demonstrations, enabling more tractable planning, more effective pretraining from unlabelled data, and often improved generalization to new tasks, environments, or modalities. Latent actions can be either continuous or discrete, are typically discovered via variational, autoencoding, or quantization methods, and serve as an intermediary between observations and environment dynamics or downstream policy modules.

1. Architectural Principles and Core Formulations

The typical LAM pipeline consists of three central modules: an inverse-dynamics encoder (IDM) that infers the latent action from pairs or sequences of observed states or observations; a forward-dynamics model (FDM) that predicts future states or observations from the current state and a latent action; and an optional vector-quantization (VQ) or commitment mechanism that enforces discreteness or regularizes the latent space.

In TAP (Jiang et al., 2022), the LAM is realized as a state-conditional VQ-VAE: the encoder $g$ processes future trajectory tokens into pooled blocks, projects and quantizes to yield discrete latent actions $z_i$ via a learned codebook $E\in\mathbb R^{K\times D}$ :

$z_i = e_{k^*},\quad k^* = \arg\min_{k\in\{1,\dots,K\}} \|q_i-e_k\|_2$

The decoder $h$ serves as a parametric, state-conditional learned dynamics model, reconstructing long-horizon futures from the current state and the $M$ -step tiled latent codes. The total loss comprises an $L_2$ reconstruction term and VQ-style codebook and commitment penalties.

More generally, across LAMs, one typically solves a variational ELBO or VQ-VAE objective, seeking latents $z$ such that

$\mathcal{L} = \mathbb{E}[\text{reconstruction error}] + \mathcal{L}_\text{reg}$

with the regularization either a KL-divergence to a prior (for VAEs) or VQ and commitment losses (for VQ-VAEs).

2. Planning and Policy Optimization in Latent Spaces

LAMs facilitate tractable planning and RL in domains where the raw action space is intractably large or unstructured. In TAP (Jiang et al., 2022), planning is conducted entirely in the compact latent action space: candidate sequences of latent codes are sampled (or enumerated) and scored via a learned reward model and likelihood under the trained distribution, yielding orders-of-magnitude reduced decision latency in high-dimensional continuous control.

In dialog domains (Zhao et al., 2019), LAMs underpin policy optimization by structuring the policy over a discrete or continuous latent $z$ , learned in an unsupervised or semisupervised fashion over supervision or generated responses. Policy search exploits the latent policy $p_{\theta_e}(z|c)$ , with only the latent policy updated during RL fine-tuning, thereby stabilizing optimization and increasing language diversity and reward.

For RL in continuous control, several works—in particular (Alles et al., 7 Nov 2024, Alles et al., 10 Dec 2025)—integrate LAMs with model-based offline RL architectures by using the latent action space as the support for policy learning, thus imposing a constraint on out-of-distribution queries and eliminating the reliance on explicit uncertainty penalties.

3. Self-Supervised and Unsupervised Latent Action Discovery

A central motivation for LAMs is the exploitation of unlabeled or partially labeled data for action learning. This is achieved by training world models in which control signals are entirely inferred as latent actions, typically via variational encoders that reverse-engineer actions from observation transitions. For example, LAWM (Tharwat et al., 22 Sep 2025) posits a generative model $p(s_{t+1} | s_t, a_t) p(a_t | s_t)$ while inferring $q(a_t|s_t, s_{t+1})$ in place of ground-truth actions.

LAPA (Ye et al., 15 Oct 2024) and similar paradigms employ encoder–decoder networks with VQ-VAE and NSVQ mechanisms to quantize small “delta” tokens of motion between frames, enabling the learning of a discrete vocabulary of action primitives solely from frame pairs of Internet-scale videos, with subsequent mapping to robot actions requiring only minimal labeled supervision.

Recent advances involve robustifying this process under distractors and in-the-wild videos. LAOM (Nikulin et al., 1 Feb 2025) introduces a multi-step inverse dynamics objective and eschews codebook quantization for high-capacity continuous latents. LAOF (Bu et al., 20 Nov 2025) uses pseudo-supervision from optical flow to bias the latent actions toward true agent motion and away from appearance distractors, with strong empirical improvements in label-sparse regimes.

4. Extensions to World Modeling, Diffusion, and Generative Architectures

LAMs have been embedded into large-scale world models for both simulation and planning. In CoLA-World (Wang et al., 30 Oct 2025), the LAM is a discrete VQ-bottleneck atop a spatial–temporal Transformer, co-trained with a powerful pre-trained video generation (OpenSora) model via a critical warm-up phase to avoid representational collapse. Joint end-to-end training creates a two-way synergy between the action encoder and the world model, yielding robust, high-entropy latent codes with superior video simulation and visual planning performance.

Further, in procedure planning and generative domains, LAMs are paired with denoising diffusion models, e.g., in CLAD (Shi et al., 9 Mar 2025), where a VAE-based constraint encoding from observed start and goal observations guides a diffusion model to synthesize semantically consistent action sequences by directly conditioning the U-Net bottleneck on the learned latent code.

For skeleton-based action segmentation or generation (Yang et al., 2023, Wang et al., 2019), LAMs parameterize primitive or composable motion directions in a disentangled latent space, augmenting transferability and compositional richness of downstream models.

5. Empirical Validation across Domains

Empirical studies demonstrate the effectiveness of LAMs across robotics, RL, language, and video generation tasks. In TAP (Jiang et al., 2022), for high-dimensional Adroit robotic hand manipulation, LAM-based planning surpasses both model-based and strong model-free baselines, and decision latency remains effectively constant as raw action dimensionality increases.

In VLA and vision-LLMs, LAPA (Ye et al., 15 Oct 2024), LAWM (Tharwat et al., 22 Sep 2025), villa-X (Chen et al., 31 Jul 2025), and LatBot (Li et al., 28 Nov 2025) show that LAM pretraining can outperform or match state-of-the-art policies trained on hundreds of thousands of labeled robot actions, especially for transfer to unseen tasks, instructions, or novel robot embodiments.

For dialog policy, LAM-based approaches (Zhao et al., 2019, Lubis et al., 2020) consistently achieve better diversity, lower perplexity, and higher task reward than word-level RL baselines with lower variance gradients. In action segmentation and human-action generation (Yang et al., 2023, Wang et al., 2019), LAMs yield significantly higher segmentation mAP and much richer motion diversity compared to models trained directly in pose or action space.

6. Limitations, Challenges, and Future Directions

Key limitations of LAMs include vulnerability to action-correlated distractors in unconstrained video, codebook collapse or under-utilization in VQ-based models, and sensitivity to the latent action capacity and quantization regime. The presence of strong, non-agent-driven visual distractors necessitates the integration of optical flow or weak action supervision to ground the latent space (Nikulin et al., 1 Feb 2025, Bu et al., 20 Nov 2025). Continuous latents, while high-capacity, can encounter unbounded exploration or instability in RL (Zhao et al., 2019).

Emerging trends include geometry-aware and multiscale LAMs for longer-horizon, more structured action reasoning (Cai et al., 30 Sep 2025), embedding physical or proprioceptive priors (e.g., LatBot’s scene/motion token decomposition (Li et al., 28 Nov 2025)), co-evolution with powerful generative world models (Wang et al., 30 Oct 2025), and integration with diffusion-based planning (Shi et al., 9 Mar 2025). Addressing scaling to more diverse robot embodiments, hierarchical control, and combining VLA with LAMs for foundation models remain active research directions.

7. Representative Algorithms and Empirical Benchmarks

Model	Domain	Representation	Key Outcome
TAP (Jiang et al., 2022)	Robot RL	Disc. VQ-VAE	Efficient high-dim planning
LAPA (Ye et al., 15 Oct 2024)	VLA Pretrain	VQ-VAE Delta-Tokens	Outperforms on real-robot
LAWM (Tharwat et al., 22 Sep 2025)	Imitation/RL	Model-agnostic	94.2% avg. SR (LIBERO)
LAOM (Nikulin et al., 1 Feb 2025)	Vision RL	Cont. latent (no VQ)	x8 latent probe gain (distr.)
LatBot (Li et al., 28 Nov 2025)	VLA Transfer	Scene/Motion tokens	98% sim., 63% 10-shot R
SSM-VLA (Cai et al., 30 Sep 2025)	VLA Gen.	Multi-scale, geom.	4.38 step avg. Chain-of-Th.
CLAD (Shi et al., 9 Mar 2025)	Proc. Planning	VAE+diffusion unif.	+8pp SR vs. state-of-the-art
LARNet (Biyani et al., 2021)	Video synth.	Hier. latent motion	28.4 PSNR, 12.9 FVD (NTU)

Conclusion

Latent Action Models underpin a broad research agenda at the intersection of efficient planning, scalable learning from observations, self-supervised robotics, offline RL, and generative sequence modeling. By learning compact, semantically grounded control abstractions, LAMs enable effective action reasoning, label-efficient transfer, and robust downstream policy optimization across modalities, tasks, and domains. Their continued development—particularly in handling distractors, scaling, compositionality, and integration with large foundation models—remains central to advancing embodied intelligence frameworks.