LAPO: Latent-to-Action Policy Optimization
- LAPO is a framework that infers and optimizes policies over latent action spaces derived from observational data without requiring ground-truth action labels.
- It employs a training pipeline with inverse dynamics and forward models using discrete latent codes to guarantee determinism, disentanglement, and informativeness.
- Empirical studies demonstrate that LAPO improves sample efficiency and robustness in complex domains, needing only minimal labeled data for accurate action mapping.
Latent-to-Action Policy Optimization (LAPO) is a suite of methodologies and objectives designed to infer and optimize policies over latent “action” spaces learned directly from observation data (typically videos), often without access to ground-truth action labels. By constructing policy learning pipelines around these latent action codes, LAPO allows scalable pre-training, unified modeling of perception and control, and efficient adaptation to new tasks and real-world environments. Rigorous mathematical analysis has clarified when the structure of the latent space recovers ground-truth actions, while empirical studies have demonstrated the statistical advantages and practical challenges that arise in complex, high-dimensional domains.
1. Formal Problem Setting and Latent Action Recovery
LAPO operates under the assumption that only state/observation transitions (and not agent actions) are observable. In the canonical setup, a dataset of i.i.d. transitions—either (s, s′) in state space (Lachapelle, 1 Oct 2025) or (o_t, o_{t+1}) in observation space (Schmidt et al., 2023, Klepach et al., 13 Feb 2025, Nikulin et al., 1 Feb 2025)—is available. The underlying data-generating process postulates an (unobserved) action drawn from and deterministic next-state transition (Lachapelle, 1 Oct 2025).
The central objective is to discover an encoder (inverse dynamics model, IDM) (with ā in a latent action set Â) and a forward model such that the encoder's output is maximally informative about the underlying action and predictive of future states. In typical vision-based LAPO (Schmidt et al., 2023, Klepach et al., 13 Feb 2025), the models are realized as neural networks equipped with vector quantization (VQ) or variational information bottlenecks.
The LAPO paradigm imposes three formal desiderata (Lachapelle, 1 Oct 2025):
- Determinism: There exists a function mapping state/action pairs to a latent code such that .
- Disentanglement: is independent of (i.e., function ), so that latent code is a function of action alone.
- Informativeness: Map 0 is injective, ensuring a one-to-one correspondence between true actions and latent codes.
When these are met, the latent representation captures all action-relevant dynamics; policy learning can proceed over these pseudo-labels and then be mapped to true actions via a compact classifier or regression head using minimal supervision (Lachapelle, 1 Oct 2025, Schmidt et al., 2023).
2. The LAPO Objective and Theoretical Guarantees
The core unsupervised LAPO training objective for the IDM/forward model pair is entropy-regularized next-state reconstruction (Lachapelle, 1 Oct 2025):
1
The entropy penalty 2 forces the encoder 3 to produce nearly deterministic (one-hot) latent codes in high-density regions of the data, promoting discrete, disentangled representations (empirically realized by reparameterization or Gumbel-Softmax/VQ bottlenecks). The model class assumptions (continuity, injectivity, topological overlap) guarantee that, at global minima, the latent codes recover a minimal, invertible, and action-aligned representation (Lachapelle, 1 Oct 2025).
In practice, related objectives employ a variational ELBO over latent actions using both an IDM and a VAE-style forward model (FDM) (Schmidt et al., 2023, Klepach et al., 13 Feb 2025):
4
where 5 is the latent action, 6 is a prior (often standard normal or categorical), and the FDM predicts future observations given 7 and current observations.
3. LAPO Training Pipeline: Unsupervised Pretraining, Policy Learning, and Decoding
LAPO learning proceeds in three modular phases (Schmidt et al., 2023, Klepach et al., 13 Feb 2025, Nikulin et al., 1 Feb 2025):
- Unsupervised Latent-Action Pretraining: Jointly train IDM and FDM to encode transitions into a discrete/action-like latent, using data-driven losses described above. In some variants, vector quantization or Gaussian bottlenecks are used to force codebook assignments and minimize entropy (Schmidt et al., 2023, Klepach et al., 13 Feb 2025).
- Policy Learning in Latent Space: After freezing the IDM/FDM, each transition is pseudo-labeled with its inferred latent code. A policy (e.g., behavior cloning 8 or 9) is trained to predict these codes from current state/observation. Latent policy optimization can proceed offline at scale.
- Decoding to True Actions and RL Integration: Mapping latent codes back to environment actuators is accomplished via supervised regression/classification (using a small set of labeled transitions) or, increasingly, by further RL/IL fine-tuning in latent or joint latent-action space (Schmidt et al., 2023, Chen et al., 30 Apr 2026). The codec head 0 is typically of 1 size.
This approach is extensible to various RL paradigms. In classic offline RL, latent-variable advantage-weighted extensions of behavior cloning, such as Latent-Variable Advantage-Weighted Policy Optimization (LAPO) (Chen et al., 2022), incorporate advantage-weighted ELBOs and KL regularization between encoded posteriors and priors to manage multi-modal demonstration data and prevent overfitting.
4. LAPO in Vision-Language-Action (VLA) and Sequence Models
Recent innovations extend LAPO to complex sequence architectures, notably for vision-language-action models (Chen et al., 30 Apr 2026). In such settings, LAPO is formulated to optimize over autoregressively sampled latent “thought” tokens (reasoning trajectory) and subsequent action-tokens, with joint policy gradients covering both latent and action spaces.
The training objective in LaST-R1 utilizes a clipped PPO-style surrogate, jointly over both action tokens and latent tokens:
2
with adaptive latent chain-of-thought length. The mechanism learns not only what action to take, but also when to terminate the reasoning process (early exit), balancing inference speed against reasoning horizon. This joint optimization of “thinking” and “acting” enhances policy efficiency and generalization, outperforming action-only policy optimization in VLA models (Chen et al., 30 Apr 2026).
5. Robustness, Object-Centricity, and Supervision with Distractors
Baseline LAPO assumes that observed transitions are explained primarily by controllable dynamics. In practice, visual distractors or environment-induced confounding factors can cause the latent actions to encode irrelevant information, degrading downstream performance (Klepach et al., 13 Feb 2025, Nikulin et al., 1 Feb 2025). Object-Centric LAPO addresses this by incorporating self-supervised object-centric pretraining (e.g., VideoSAUR), feeding only task-relevant slot representations to the latent action modules (Klepach et al., 13 Feb 2025). This yields substantial improvements in proxy-label quality and downstream performance, with masked/slot-based inputs halving to quartering error in linear-probe MSE and increasing behavior cloning performance up to 2.6× compared to standard LAPO in distracted domains.
A critical empirical finding is that, with action-correlated distractors, standard unsupervised LAPO no longer reliably recovers true actions. Modifications such as LAOM (which drops quantization, uses multi-step consistency losses, and introduces data augmentations) drastically improve robustness (Nikulin et al., 1 Feb 2025). Injecting even minimal directly supervised action labels (2.5% of data) during pretraining with LAOM improves downstream performance up to 4.2×, indicating that semi-supervised latent action learning is essential in the presence of real-world noise and distractors.
6. Statistical Benefits, Sample Efficiency, and Limitations
Theoretical and empirical evidence demonstrates that LAPO’s pseudo-label-based workflow provides a sharp reduction in sample complexity for behavior cloning: massive amounts of unlabeled video can be converted into (s, latent, s′) datasets, with only 3 labeled transitions sufficient to ground the latent-to-true action mapping (Lachapelle, 1 Oct 2025, Schmidt et al., 2023). When the identified latent code satisfies determinism, disentanglement, and informativeness, downstream RL, behavior cloning, or hybrid algorithms exhibit rapid convergence and improved robustness to OOD conditions (Lachapelle, 1 Oct 2025, Schmidt et al., 2023, Chen et al., 30 Apr 2026).
Limitations include:
- Sensitivity to distractors: Action-correlated visual or background changes corrupt latent action learning unless object-centric or regularized extensions are employed (Klepach et al., 13 Feb 2025, Nikulin et al., 1 Feb 2025).
- Requirement for manual or semi-supervised slot selection in object-centric methods (Klepach et al., 13 Feb 2025).
- Two-stage pipelines: Most approaches are non-end-to-end, with separate pretraining and policy learning/fine-tuning steps.
- The need for minimal but nonzero action supervision in complex, noisy domains (Nikulin et al., 1 Feb 2025).
7. Empirical Benchmarks and Notable Results
- On the Procgen benchmark, LAPO (unsupervised pretraining + latent-action policy + minimal action supervision or RL) exceeds PPO-from-scratch by over 2×, reaching or surpassing expert performance in most environments with only a fraction of the required frames (Schmidt et al., 2023).
- In the Distracting Control Suite and Distracting MetaWorld, object-centric LAPO halves to quarters latent action MSE and achieves up to 2–7× sample efficiency improvement in downstream policy finetuning (Klepach et al., 13 Feb 2025).
- In VLA robotic manipulation (LaST-R1+LAPO), the combination of joint latent/action RL and adaptive CoT yields 99.8% average success on the LIBERO benchmark and up to 44% absolute improvement in real-world dual-arm tasks after RL post-training, with only 8% generalization degradation on OOD objects/backgrounds (Chen et al., 30 Apr 2026).
- LATENT-variable AWR-style LAPO achieves a 49% improvement over the next-best offline RL method in highly heterogeneous datasets (Chen et al., 2022).
References
- "On the Identifiability of Latent Action Policies" (Lachapelle, 1 Oct 2025)
- "Learning to Act without Actions" (Schmidt et al., 2023)
- "Object-Centric Latent Action Learning" (Klepach et al., 13 Feb 2025)
- "Latent Action Learning Requires Supervision in the Presence of Distractors" (Nikulin et al., 1 Feb 2025)
- "LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models" (Chen et al., 30 Apr 2026)
- "Latent-Variable Advantage-Weighted Policy Optimization for Offline RL" (Chen et al., 2022)
- "LAVA: Latent Action Spaces via Variational Auto-encoding for Dialogue Policy Optimization" (Lubis et al., 2020)