Necessity of Test-Time Future Imagination in World Action Models

Determine whether explicit future generation of visual observations during inference is necessary to achieve strong action performance in World Action Models, or whether the primary gains arise from the video prediction objective used during training.

Background

World Action Models (WAMs) combine future visual prediction with action modeling, and many recent systems follow an imagine-then-execute design that synthesizes future video before predicting actions. This approach can introduce substantial inference latency due to iterative video denoising.

The paper highlights a fundamental uncertainty about where WAM performance gains originate: from explicit test-time future imagination or from representation learning induced by video modeling during training. Fast-WAM is proposed to decouple these factors by retaining video co-training during training while skipping future generation at test time, enabling a controlled investigation of this question.

References

More fundamentally, it remains unclear whether explicit future imagination is actually necessary for strong action performance.

Fast-WAM: Do World Action Models Need Test-time Future Imagination?  (2603.16666 - Yuan et al., 17 Mar 2026) in Section 1: Introduction