Papers
Topics
Authors
Recent
2000 character limit reached

Action-Conditioned World Models

Updated 4 February 2026
  • Action-conditioned world models are predictive systems that combine current observations with action sequences to forecast future states and outcomes.
  • They utilize diverse architectures including latent embedding regression, diffusion-based models, and autoregressive transformers for multi-modal predictions.
  • These models facilitate robotic control, autonomous driving, and policy evaluation, though challenges like long-horizon drift and reward bias remain.

Action-conditioned world models are predictive models that explicitly incorporate action sequences to forecast future states, observations, or semantic outcomes. By explicitly modeling the environment's response to actions, these models enable planning, policy optimization, simulation, evaluation, and transfer for complex embodied agents, including robots and autonomous vehicles. The field encompasses a range of formulations, including pixel-level generative models, latent-space predictors, symbolic state-transition encoders, and vision-language architectures.

1. Definitional Scope and Formal Structure

An action-conditioned world model learns the environment’s dynamics as a conditional distribution over future states (or outputs), given current observations and a proposed action sequence. The canonical formulation can be illustrated as follows:

  • Autoregressive pixel-space/world simulation:

p(ot+1ot,at)p(o_{t+1}|o_t, a_t) where oto_t is the observation (e.g., image, proprioception), and ata_t is the action.

  • Latent-space predictive models:

E.g., in NORA-1.5, a world model WθW_\theta is defined as: Wθ(ot,at:t+N)=Pθ(J(ot),at:t+N)W_\theta(o_t, a_{t:t+N}) = P_\theta(J(o_t), a_{t:t+N}) with JJ a visual encoder (Hung et al., 18 Nov 2025).

  • Semantic/vision-language prediction:

pwm(AS,a1:h,Q)p_{wm}(A|S, a_{1:h}, Q) where answers AA to future-conditional questions QQ are predicted with respect to state SS under action sequence a1:ha_{1:h} (Berg et al., 22 Oct 2025).

  • Symbolic/logical models:

STRIPS-style next-action validity and effect models condition future applicability and state on action sequences, enabling logical planning and verification (Núñez-Molina et al., 16 Sep 2025).

The general requirement is that the model provides a mechanism to "roll out" future trajectories under explicit action control, either for open-loop simulation or closed-loop planning and policy improvement.

2. Model Architectures and Representations

Action-conditioned world models span several architectural paradigms:

  • Latent embedding regression: Used in NORA-1.5, where a V-JEPA2 encoder maps observations to embeddings, and a predictor transformer PθP_\theta forecasts the next embedding conditioned on actions (Hung et al., 18 Nov 2025). Training minimizes an L1L_1 loss between predicted and ground-truth future embeddings.
  • Video diffusion models with action adapters:

AVID demonstrates the retrofitting of closed-source video diffusion models for action conditioning by inserting a lightweight U-Net adapter and learned mask. The adapter processes per-frame action embeddings via FiLM layers, and a learned mask interpolates between original backbone predictions and adapter outputs (Rigter et al., 2024).

  • Autoregressive vision-language-action transformers:

WorldVLA unifies image, action, and language tokens in a large autoregressive transformer, sharing an embedding space and employing discrete tokenization for each modality (Cen et al., 26 Jun 2025).

  • Latent-state dynamical systems:

Models like Joint-Embedded Predictive Architectures (JEPA) encode observations into latent states and predict future latents via a learned dynamics function, where actions enter as inputs to an MLP predictor (Destrade et al., 28 Dec 2025).

  • Latent action models:

In latent-action world models, actions are either inferred from data or learned as hidden variables. Dedicated inverse models estimate latent actions that best explain transitions, and a generative forward model uses these to simulate future states (Garrido et al., 8 Jan 2026, Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025).

  • Vision-language semantic predictors:

Semantic World Models use VLMs fine-tuned to answer natural-language questions about future outcomes conditioned on an action sequence (Berg et al., 22 Oct 2025).

  • MaskGIT-based multi-modal transformers:

ChronoDreamer leverages a spatial-temporal transformer trained with a masked token prediction objective (MaskGIT) for video, contact maps, and proprioceptive predictions, all conditioned on a history of actions (Zhou et al., 21 Dec 2025).

  • Symbolic transformers for discrete world models:

STRIPS-world learning with transformers relies on hard attention per proposition and stick-breaking aggregation, operating over sequences of action tokens to enforce logical precondition-effect semantics (Núñez-Molina et al., 16 Sep 2025).

3. Training Objectives and Loss Functions

Common loss formulations include:

  • Regression or moment-matching on predicted future embeddings:

LWM(θ)=E[Pθ(J(ot),at:t+N)J(ot+N)1]L_{WM}(\theta) = \mathbb{E}[\|P_\theta(J(o_t), a_{t:t+N}) - J(o_{t+N})\|_1] as in NORA-1.5 (Hung et al., 18 Nov 2025).

  • Score matching in diffusion models:

L(θ)=Eϵfinal(zi,a,i,z0)ϵ2L(\theta) = \mathbb{E}\|\epsilon_{\text{final}}(z_i,a,i,z^0) - \epsilon\|^2 as in AVID, where ϵfinal\epsilon_{\text{final}} combines base and action-conditioned adapters (Rigter et al., 2024).

  • (Masked) cross-entropy for token-based multi-modal outputs:

Used in MaskGIT-style systems and multi-headed transformers to predict video, contact, and action tokens autoregressively (Zhou et al., 21 Dec 2025, Cen et al., 26 Jun 2025).

  • ELBO for latent variable world models:

Penalizing both reconstruction errors and KL divergence of latent state and action variables for both action-conditioned and action-free data (Alles et al., 10 Dec 2025, Garrido et al., 8 Jan 2026, Gao et al., 24 Mar 2025).

  • Auxiliary value shaping or value-geometry alignment:

JEPA-based planners augment the standard prediction loss with a constraint enforcing the negative goal-conditioned value function to be close to a distance in latent space, applied via expectile regression (Destrade et al., 28 Dec 2025).

  • Token-level and semantic matching metrics:

Instruction-Execution Consistency, Average Displacement Error, and semantic VQA accuracy measure the world model's fidelity to action instructions and semantic future states (Arai et al., 2024, Berg et al., 22 Oct 2025).

4. Practical Applications and Evaluation Protocols

Action-conditioned world models are applied in a variety of contexts:

5. Reward Construction and Preference Optimization

World models are often utilized as surrogate reward functions for policy post-training and selection:

  • Goal-based reward via world model rollouts:

The forecasted future embedding is compared to either a subgoal or final goal embedding to yield a dense reward: Rg(at:t+N,ot)=J(og)y^1R_g(a_{t:t+N}, o_t) = - \| J(o_g) - \hat y \|_1 where JJ is the visual encoder and y^\hat y the predicted embedding (Hung et al., 18 Nov 2025).

  • Blending with action deviation scores:

To improve reward robustness, action deviation from demonstrations forms an additional term: Ra(at:t+N)=at:t+Nat:t+N1R_a(a_{t:t+N}) = - \| a^*_{t:t+N} - a_{t:t+N} \|_1 with the final reward a linear blend, e.g., Rtot=Rg+0.5RaR_\text{tot} = R_g + 0.5 R_a (Hung et al., 18 Nov 2025).

  • Dataset construction for preference optimization:

Action sequences are ranked according to RtotR_\text{tot}, creating (winner, loser) pairs for fine-tuning the policy via Direct Preference Optimization (DPO). The DPO loss ensures the policy prefers actions with higher reward-model scores (Hung et al., 18 Nov 2025).

  • Semantic reward and VQA-based planning:

In VLM-based world models, planning proceeds by maximizing the probability of correct semantic answers (e.g., “has the block been stacked?”), using cross-entropy or value-weighted simulated rollouts (Berg et al., 22 Oct 2025).

6. Challenges, Evaluation, and Future Directions

Despite their flexibility, action-conditioned world models face several limitations:

  • Fidelity and compounding error in long-horizon rollouts:

Even state-of-the-art models (e.g., Terra, Ctrl-World) exhibit drift from intended motions, especially for rare or complex action sequences (Arai et al., 2024, Guo et al., 11 Oct 2025). Blockwise or memory-augmented rollouts mitigate, but do not eliminate, these effects.

  • Reward bias and estimation challenges:

Value estimation in learned world model rollouts exhibits systematic underestimation for in-distribution actions and overestimation for out-of-distribution behaviors, limiting their use as ground-truth evaluators (Quevedo et al., 31 May 2025).

  • Transfer and adaptation:

Models like AdaWorld and LAWM highlight improved label efficiency by using latent-action representations to bridge action-labeled and action-free data, but adaptation to new domains and actions can still require finetuning (Alles et al., 10 Dec 2025, Gao et al., 24 Mar 2025).

  • Symbolic and logical generalization:

While symbolic world models and STRIPS-based transformers achieve perfect legal action sequence recovery in small domains (Núñez-Molina et al., 16 Sep 2025), scaling to high-dimensional or partial observability settings remains open.

  • Evaluation protocols:

ACT-Bench recommends separating action fidelity from visual or task performance, using open-source per-frame metric estimates (IEC, ADE, FDE) and public evaluation tools (Arai et al., 2024).

Open research directions include scalable latent action discovery, joint policy–world model co-training for robust planning, advances in model-based safety/rejection, physically robust simulation (contacts, deformables), and integrating multi-modal and semantic representations for richer planning and generalization.


Relevant Key References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Action-Conditioned World Models.