Learning Visual Feature-Based World Models via Residual Latent Action

Published 8 May 2026 in cs.CV, cs.AI, cs.LG, and cs.RO | (2605.07079v1)

Abstract: World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as Residual Latent Action (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose RLA World Model (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces Residual Latent Action (RLA) to encode visual state transitions compactly, overcoming inefficiencies in pixel and direct feature regression models.
It presents the RLA World Model (RLA-WM) which uses an attention-based autoencoder and flow matching, achieving superior fidelity and computational efficiency over state-of-the-art methods.
Empirical studies show that RLA-WM improves visual prediction quality and policy success rates in robot learning, while reducing FLOPs by two to three orders of magnitude.

Residual Latent Action World Models: High-Fidelity, Efficient Feature-Based Dynamics Learning

Introduction

This work introduces a novel approach for visual world modeling that explicitly addresses inefficiencies and accuracy limitations inherent in both pixel-space video generation models and prior direct-regression feature-space models. The paper proposes the Residual Latent Action (RLA), a concise latent action representation derived from DINO token residuals, and leverages RLA in a new world model architecture, the RLA World Model (RLA-WM). Empirical results demonstrate that RLA-WM surpasses state-of-the-art alternatives, including both feature regression and video-diffusion methods, in fidelity, generalization, and computational efficiency. Theoretical and practical implications center on unlocking visual model-based policy learning from purely offline videos, including in the absence of action supervision.

Background: Feature-Based World Models and Limitations

Visual world models typically learn state transition dynamics from image observations and action sequences. Pixel-level generative models (e.g., diffusion or VQ-VAE in video space) are powerful but computationally prohibitive and prone to hallucinated transitions, constraining their viability for closed-loop robot learning and control applications. Feature-based models predict transitions in pre-trained visual representation spaces—such as DINO tokens—exploiting the structured, information-rich nature of features distilled via self-supervised learning or joint-embedding architectures.

DINO-WM and related approaches have shown that direct regression on DINO tokens produces fast, effective models for simple 2D manipulation. However, in complex 3D settings, these methods degrade, with regressed features becoming blurry or mode-collapsed when tasked with modeling multimodal, high-variance dynamics. When generative modeling is attempted in feature space (e.g., via diffusion or flow matching), the immense dimensionality of semantic representations (e.g., DINOv3-L yields sequence length × channel count ≫ pixel count) results in severe inefficiency and convergence difficulties.

Residual Latent Action: Derivation and Properties

The authors' central insight is that valid physical transitions reside on a low-dimensional manifold, even within a high-dimensional feature space. They formalize the transition between two DINO token states, $s_t$ and $s_{t+h}$ , as a feature-space residual, which they encode into a compact vector $z$ using an attention-based autoencoder. This Residual Latent Action (RLA) serves as a direct, invertible representation of the dynamics required to map from $s_t$ to $s_{t+h}$ .

Key properties observed and empirically validated for RLA include:

Predictive sufficiency: The RLA, jointly with the current state $s_t$ , enables a decoder to reconstruct the future state $s_{t+h}$ with high fidelity in a single feedforward pass, in contrast to prior approaches where latent variables function only as weak conditioning signals for iterative generative processes.
Generalizability: The learned latent space encodes physically coherent dynamics and generalizes to diverse objects, scenes, and embodiment variations, despite being trained across limited data distributions.
Temporal topology: The latent space is organized such that interpolation between a random code and a true RLA produces temporally intermediate states that correspond to plausible visual interpolants, indicating a smooth, physically meaningful embedding.

RLA World Model (RLA-WM): Architecture and Learning

The RLA-WM leverages the compact structure of RLA to implement efficient and accurate feature-based dynamics modeling. Rather than directly predicting high-dimensional future DINO tokens, the model predicts the low-dimensional RLA code $z$ from the current state $s_t$ and future action sequence $a_{t:t+h}$ . Flow matching is performed within the RLA space, initialized with Gaussian noise and ODE integration steps parameterized by the action-conditioned latent velocity field.

The predicted RLA code is then decoded, concatenated with $s_{t+h}$ 0, to recover the future representation $s_{t+h}$ 1. The latent dynamics network and decoder use self-attention and linear projections throughout, significantly reducing computational expense compared to diffusion-based models. Model supervision is provided by mean-squared error in RLA, without requiring image-space reconstruction losses or auxiliary tasks.

Empirical results on ManiSkill and IWS benchmarks substantiate:

RLA-WM achieves superior LPIPS, SSIM, and DINO-token L1 scores on both simulated and real-world manipulation tasks, outperforming direct regression (DINO-WM), video-diffusion (Vid2World), and generative feature-space baselines (RAE, FM-WM).
RLA-WM operates at two to three orders of magnitude lower FLOPs than video-diffusion alternatives, with only a minor overhead compared to direct regression, despite dramatically higher fidelity and long-horizon accuracy.
No evidence of hallucination or physical inconsistency is present in predicted rollouts, as confirmed by qualitative visualizations.

Implications for Robot Policy Learning

Two downstream applications of RLA-WM are proposed, both extending beyond capabilities unlocked by prior work:

Minimalist World Action Model from Actionless Video

The architecture integrates a linear branch into a standard ResNet-18 behavior cloning policy to predict RLAs given the current observation. This formulation enables learning from actionless demonstrations: RLA regression provides supervisory signal for action-marginalized trajectories, while actions are used (if present) for supervised prediction. On all tasks, policies trained with RLA-augmented objectives yield notably higher success rates (~+8–12%) compared with action-only or state regression baselines, and outperform competitive latent action methods (UniVLA, AdaWorld, DINO-CLS) across the studied tasks.

Visual RL Entirely within Learned World Models

A Proximal Policy Optimization (PPO) agent is trained entirely inside the learned RLA-WM. The reward function is defined as negative DINO-token L1 distance to either time-synchronized reference or terminal goal state from the offline dataset, eliminating any need for online interactions, simulator access, or handcrafted reward shaping. Over large-scale evaluations, world-model RL (WMRL) with RLA-WM consistently outperforms pure behavior cloning by an average of +1.1%, and produces optimal checkpoints reliably across seeds and rollouts, except in isolated task/robot combinations where data or embodiment differences dominate.

Theoretical and Practical Implications

This work advances the argument for structured, compact, feature-based transition modeling in high-dimensional visual domains. By shifting the predictive target from absolute state features to latent residual codes, RLA-WM avoids the regression-to-the-mean degeneracies that hinder direct feature-space forecasting, and circumvents the inefficiency of explicit pixel or diffusion-based generative architectures. The tight integration of physics-informed representation (encoding only the meaningful dynamics) with data-efficient, reward-free policy optimization situates RLA-WM as a promising backbone for scalable, generalizable, and resource-efficient visuomotor agents.

The principal theoretical implication is that explicit modeling of state residuals (akin to velocity or displacement in physical systems) in feature space—versus predicting absolute state—significantly improves feature-based world model expressivity and compressibility. Practically, decoupling world modeling from action supervision and simulator interaction paves the way for leveraging vast, unlabelled video corpora for robot learning.

Limitations and Future Directions

The current formulation is subject to several limitations:

RLA encoding can be compromised by task-irrelevant visual background motion or occlusions, motivating the need for view-invariant modeling (potentially by projecting features into 3D space) and the incorporation of memory across larger frame sequences.
The RLA-WM currently only models visual transitions, not full proprioceptive state evolution, which may limit applicability in domains where joint space modeling is essential.
Benchmarked datasets are relatively small and simulated; extension to Internet-scale, real-world video data and complex scene diversity remains necessary for open-domain generalization.

Conclusion

The paper proposes Residual Latent Action as a concise, predictive encoding of visual state transitions, and leverages it within the RLA World Model to set new standards in visual feature-based world modeling. Strong quantitative results and practical demonstrations in policy optimization and imitation from actionless video highlight the architectural advantages. Addressing the outlined limitations—especially view-invariant and memory-augmented extensions—may position RLA-class models as key components in next-generation scalable, sample-efficient, and robust visuomotor learning frameworks.

Reference:

"Learning Visual Feature-Based World Models via Residual Latent Action" (2605.07079)

Markdown Report Issue