- The paper introduces Residual Latent Action (RLA) to encode visual state transitions compactly, overcoming inefficiencies in pixel and direct feature regression models.
- It presents the RLA World Model (RLA-WM) which uses an attention-based autoencoder and flow matching, achieving superior fidelity and computational efficiency over state-of-the-art methods.
- Empirical studies show that RLA-WM improves visual prediction quality and policy success rates in robot learning, while reducing FLOPs by two to three orders of magnitude.
Residual Latent Action World Models: High-Fidelity, Efficient Feature-Based Dynamics Learning
Introduction
This work introduces a novel approach for visual world modeling that explicitly addresses inefficiencies and accuracy limitations inherent in both pixel-space video generation models and prior direct-regression feature-space models. The paper proposes the Residual Latent Action (RLA), a concise latent action representation derived from DINO token residuals, and leverages RLA in a new world model architecture, the RLA World Model (RLA-WM). Empirical results demonstrate that RLA-WM surpasses state-of-the-art alternatives, including both feature regression and video-diffusion methods, in fidelity, generalization, and computational efficiency. Theoretical and practical implications center on unlocking visual model-based policy learning from purely offline videos, including in the absence of action supervision.
Background: Feature-Based World Models and Limitations
Visual world models typically learn state transition dynamics from image observations and action sequences. Pixel-level generative models (e.g., diffusion or VQ-VAE in video space) are powerful but computationally prohibitive and prone to hallucinated transitions, constraining their viability for closed-loop robot learning and control applications. Feature-based models predict transitions in pre-trained visual representation spacesโsuch as DINO tokensโexploiting the structured, information-rich nature of features distilled via self-supervised learning or joint-embedding architectures.
DINO-WM and related approaches have shown that direct regression on DINO tokens produces fast, effective models for simple 2D manipulation. However, in complex 3D settings, these methods degrade, with regressed features becoming blurry or mode-collapsed when tasked with modeling multimodal, high-variance dynamics. When generative modeling is attempted in feature space (e.g., via diffusion or flow matching), the immense dimensionality of semantic representations (e.g., DINOv3-L yields sequence length ร channel count โซ pixel count) results in severe inefficiency and convergence difficulties.
Residual Latent Action: Derivation and Properties
The authors' central insight is that valid physical transitions reside on a low-dimensional manifold, even within a high-dimensional feature space. They formalize the transition between two DINO token states, stโ and st+hโ, as a feature-space residual, which they encode into a compact vector z using an attention-based autoencoder. This Residual Latent Action (RLA) serves as a direct, invertible representation of the dynamics required to map from stโ to st+hโ.
Key properties observed and empirically validated for RLA include:
- Predictive sufficiency: The RLA, jointly with the current state stโ, enables a decoder to reconstruct the future state st+hโ with high fidelity in a single feedforward pass, in contrast to prior approaches where latent variables function only as weak conditioning signals for iterative generative processes.
- Generalizability: The learned latent space encodes physically coherent dynamics and generalizes to diverse objects, scenes, and embodiment variations, despite being trained across limited data distributions.
- Temporal topology: The latent space is organized such that interpolation between a random code and a true RLA produces temporally intermediate states that correspond to plausible visual interpolants, indicating a smooth, physically meaningful embedding.
RLA World Model (RLA-WM): Architecture and Learning
The RLA-WM leverages the compact structure of RLA to implement efficient and accurate feature-based dynamics modeling. Rather than directly predicting high-dimensional future DINO tokens, the model predicts the low-dimensional RLA code z from the current state stโ and future action sequence at:t+hโ. Flow matching is performed within the RLA space, initialized with Gaussian noise and ODE integration steps parameterized by the action-conditioned latent velocity field.
The predicted RLA code is then decoded, concatenated with st+hโ0, to recover the future representation st+hโ1. The latent dynamics network and decoder use self-attention and linear projections throughout, significantly reducing computational expense compared to diffusion-based models. Model supervision is provided by mean-squared error in RLA, without requiring image-space reconstruction losses or auxiliary tasks.
Empirical results on ManiSkill and IWS benchmarks substantiate:
- RLA-WM achieves superior LPIPS, SSIM, and DINO-token L1 scores on both simulated and real-world manipulation tasks, outperforming direct regression (DINO-WM), video-diffusion (Vid2World), and generative feature-space baselines (RAE, FM-WM).
- RLA-WM operates at two to three orders of magnitude lower FLOPs than video-diffusion alternatives, with only a minor overhead compared to direct regression, despite dramatically higher fidelity and long-horizon accuracy.
- No evidence of hallucination or physical inconsistency is present in predicted rollouts, as confirmed by qualitative visualizations.
Implications for Robot Policy Learning
Two downstream applications of RLA-WM are proposed, both extending beyond capabilities unlocked by prior work:
Minimalist World Action Model from Actionless Video
The architecture integrates a linear branch into a standard ResNet-18 behavior cloning policy to predict RLAs given the current observation. This formulation enables learning from actionless demonstrations: RLA regression provides supervisory signal for action-marginalized trajectories, while actions are used (if present) for supervised prediction. On all tasks, policies trained with RLA-augmented objectives yield notably higher success rates (~+8โ12%) compared with action-only or state regression baselines, and outperform competitive latent action methods (UniVLA, AdaWorld, DINO-CLS) across the studied tasks.
Visual RL Entirely within Learned World Models
A Proximal Policy Optimization (PPO) agent is trained entirely inside the learned RLA-WM. The reward function is defined as negative DINO-token L1 distance to either time-synchronized reference or terminal goal state from the offline dataset, eliminating any need for online interactions, simulator access, or handcrafted reward shaping. Over large-scale evaluations, world-model RL (WMRL) with RLA-WM consistently outperforms pure behavior cloning by an average of +1.1%, and produces optimal checkpoints reliably across seeds and rollouts, except in isolated task/robot combinations where data or embodiment differences dominate.
Theoretical and Practical Implications
This work advances the argument for structured, compact, feature-based transition modeling in high-dimensional visual domains. By shifting the predictive target from absolute state features to latent residual codes, RLA-WM avoids the regression-to-the-mean degeneracies that hinder direct feature-space forecasting, and circumvents the inefficiency of explicit pixel or diffusion-based generative architectures. The tight integration of physics-informed representation (encoding only the meaningful dynamics) with data-efficient, reward-free policy optimization situates RLA-WM as a promising backbone for scalable, generalizable, and resource-efficient visuomotor agents.
The principal theoretical implication is that explicit modeling of state residuals (akin to velocity or displacement in physical systems) in feature spaceโversus predicting absolute stateโsignificantly improves feature-based world model expressivity and compressibility. Practically, decoupling world modeling from action supervision and simulator interaction paves the way for leveraging vast, unlabelled video corpora for robot learning.
Limitations and Future Directions
The current formulation is subject to several limitations:
- RLA encoding can be compromised by task-irrelevant visual background motion or occlusions, motivating the need for view-invariant modeling (potentially by projecting features into 3D space) and the incorporation of memory across larger frame sequences.
- The RLA-WM currently only models visual transitions, not full proprioceptive state evolution, which may limit applicability in domains where joint space modeling is essential.
- Benchmarked datasets are relatively small and simulated; extension to Internet-scale, real-world video data and complex scene diversity remains necessary for open-domain generalization.
Conclusion
The paper proposes Residual Latent Action as a concise, predictive encoding of visual state transitions, and leverages it within the RLA World Model to set new standards in visual feature-based world modeling. Strong quantitative results and practical demonstrations in policy optimization and imitation from actionless video highlight the architectural advantages. Addressing the outlined limitationsโespecially view-invariant and memory-augmented extensionsโmay position RLA-class models as key components in next-generation scalable, sample-efficient, and robust visuomotor learning frameworks.
Reference:
"Learning Visual Feature-Based World Models via Residual Latent Action" (2605.07079)