Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 27 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 117 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4 34 tok/s Pro
2000 character limit reached

FlowVLA: Thinking in Motion with a Visual Chain of Thought (2508.18269v2)

Published 25 Aug 2025 in cs.RO

Abstract: Many Vision-Language-Action (VLA) models are built upon an internal world model trained via direct next-frame prediction ($v_t \rightarrow v_{t+1}$). This paradigm, however, presents a fundamental challenge: it \textbf{conflates} the task of predicting physical motion with that of rendering static appearance, forcing a single mechanism to handle both. This inherent coupling often leads to physically implausible forecasts and inefficient policy learning. To address this limitation, we introduce the \textbf{Visual Chain of Thought (Visual CoT)}, a framework that disentangles these processes by compelling the model to first reason about \textbf{motion dynamics} before generating the future frame's \textbf{visual appearance}. We instantiate this principle by proposing \textbf{FlowVLA}, an autoregressive Transformer that explicitly materializes this reasoning process as ``$v_t \rightarrow f_t \rightarrow v_{t+1}$'', where $f_t$ is an intermediate optical flow prediction. By forcing the model to first commit to a motion plan ($f_t$), FlowVLA learns disentangled dynamics, resulting in more coherent visual predictions and significantly more efficient policy learning. Experiments on challenging robotics manipulation benchmarks demonstrate that FlowVLA achieves state-of-the-art performance with substantially improved sample efficiency, pointing toward a more principled foundation for world modeling in VLAs. Project page: https://irpn-lab.github.io/FlowVLA/

Summary

  • The paper introduces FlowVLA, which improves world model learning by explicitly separating motion (optical flow) and appearance prediction for more coherent outcomes.
  • It employs a two-stage training approach with a unified tokenization scheme within an autoregressive transformer, ensuring efficient prediction and policy fine-tuning.
  • FlowVLA achieves state-of-the-art results on robotics benchmarks, demonstrating faster convergence, robust generalization, and enhanced physical plausibility in dynamic environments.

FlowVLA: Visual Chain of Thought for World Model Learning

Introduction

FlowVLA introduces a novel paradigm for world model learning in Vision-Language-Action (VLA) systems by explicitly disentangling appearance and motion through a Visual Chain of Thought (Visual CoT) framework. The central hypothesis is that next-frame prediction, the dominant approach in VLA world models, conflates static scene understanding with dynamic reasoning, resulting in physically implausible predictions and inefficient policy learning. FlowVLA addresses this by enforcing a structured reasoning process: the model first predicts an intermediate optical flow representation (motion), then conditions on this to forecast the next visual frame (appearance). This approach is implemented within a unified autoregressive Transformer architecture, leveraging shared tokenization for both modalities. Figure 1

Figure 1: The FlowVLA two-stage training paradigm, illustrating world model pre-training with Visual CoT and subsequent policy fine-tuning for robot control.

FlowVLA Architecture and Training Paradigm

Two-Stage Training

FlowVLA adopts a two-stage training paradigm:

  1. World Model Pre-training: The model learns general physical dynamics from large-scale, action-free video data by executing the Visual CoT reasoning process.
  2. Policy Fine-tuning: The pre-trained world model is adapted for control, fine-tuned to generate robot actions from visual observations and language instructions. Figure 2

    Figure 2: FlowVLA framework architecture, showing unified tokenization and interleaved prediction of appearance and motion tokens in Stage 1, and policy fine-tuning in Stage 2.

Unified Tokenization of Appearance and Motion

A key technical contribution is the unified tokenization scheme for both appearance (RGB frames) and motion (optical flow). Optical flow fields are encoded as RGB-like images using a polar coordinate mapping, allowing both modalities to be processed by a shared VQ-GAN tokenizer. This design maintains architectural simplicity and parameter efficiency, enabling the autoregressive Transformer to learn interleaved sequences of appearance and motion tokens.

Autoregressive Visual Chain of Thought

The model is trained to predict sequences of the form vtftvt+1v_t \rightarrow f_t \rightarrow v_{t+1}, where vtv_t is the current frame, ftf_t is the predicted optical flow, and vt+1v_{t+1} is the next frame. The training objective is a sum of cross-entropy losses over both flow and frame tokens, enforcing explicit reasoning about dynamics before appearance prediction.

Experimental Results

Robotics Benchmarks

FlowVLA is evaluated on LIBERO and SimplerEnv, two challenging robotics manipulation benchmarks. On LIBERO, FlowVLA achieves state-of-the-art success rates across all suites, with the most pronounced gains on long-horizon tasks, indicating superior physical reasoning and planning capabilities. On SimplerEnv, which introduces significant visual domain shifts, FlowVLA demonstrates robust generalization, outperforming prior models by a substantial margin.

World Modeling Capabilities

Qualitative analysis on the Bridge V2 dataset reveals that standard next-frame prediction baselines suffer from two critical failure modes: physical implausibility (e.g., disappearing manipulators, erratic object motion) and semantic misalignment (predicted actions not matching language instructions). FlowVLA, by reasoning about motion first, produces physically coherent and semantically aligned predictions. Figure 3

Figure 3

Figure 3: Example of FlowVLA maintaining physical plausibility in the task "Put the rectangular on top of the rectangular block next to it."

Figure 4

Figure 4

Figure 4: Example of FlowVLA achieving semantic alignment in the task "Put the toy into left of table."

Convergence Speed and Sample Efficiency

FlowVLA exhibits dramatically improved sample efficiency during policy fine-tuning. In both full and low-data regimes, FlowVLA converges faster and achieves higher peak performance compared to baselines. The efficiency gap widens under data scarcity, validating the inductive bias introduced by explicit motion reasoning. Figure 5

Figure 5: Training efficiency comparison, showing FlowVLA's superior convergence speed and final success rate when trained on 100% of data.

Ablation Studies

Ablation experiments confirm the necessity of the Visual CoT structure, direct supervision of the flow prediction, and the interleaved sequence format. Removing any of these components leads to significant drops in success rate, with the interleaved causal chain (vtftvt+1v_t \rightarrow f_t \rightarrow v_{t+1}) being essential for effective planning and action generation.

Theoretical and Practical Implications

FlowVLA's explicit separation of appearance and motion learning addresses a fundamental limitation in world model design for robotics. By enforcing a reasoning-first approach, the model acquires a more causally grounded understanding of physical dynamics, which translates to improved long-horizon prediction, robust generalization, and efficient policy adaptation. The unified tokenization and autoregressive architecture facilitate scalable training and deployment without introducing modality-specific components.

Practically, FlowVLA's sample efficiency and robustness to domain shifts make it well-suited for real-world robotic applications, where data collection is expensive and environments are variable. The framework also provides a foundation for integrating low-level physical reasoning with higher-level semantic and geometric planning, potentially enabling more capable and generalist embodied agents.

Future Directions

Potential future developments include:

  • Extending the Visual CoT framework to incorporate additional intermediate representations (e.g., depth, contact maps) for richer physical reasoning.
  • Integrating FlowVLA with hierarchical planning systems that combine pixel-level dynamics with abstract task decomposition.
  • Exploring transfer learning and continual adaptation in open-ended environments, leveraging the model's robust world understanding.
  • Investigating the limits of unified tokenization for multimodal reasoning beyond vision and motion, such as tactile or auditory signals.

Conclusion

FlowVLA establishes Visual Chain of Thought as a principled approach for world model learning in VLA systems. By decomposing prediction into motion and appearance, FlowVLA achieves coherent dynamics, state-of-the-art performance, and improved sample efficiency. The results underscore the importance of explicit motion reasoning for bridging perception and control, with significant implications for the design of generalist robotic agents.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com