FlowVLA: Unified Vision-Language-Action Model

Updated 26 August 2025

FlowVLA is a unified sequence modeling framework that decouples appearance and motion using an intermediate optical flow representation, ensuring physically plausible predictions.
It employs a single autoregressive Transformer with a VQ-GAN tokenization pipeline to interleave visual and motion tokens, resulting in enhanced sample efficiency.
FlowVLA achieves state-of-the-art performance on robotics benchmarks, demonstrating faster convergence and improved robustness to domain shifts.

FlowVLA is a unified sequence modeling framework for world modeling in Vision-Language-Action (VLA) systems, predicated on the “Visual Chain of Thought” (Visual CoT) paradigm. Rather than solely predicting future frames in a visual sequence, FlowVLA first infers an intermediate optical flow representation encapsulating motion dynamics, then predicts the subsequent visual state. This approach leverages a single autoregressive Transformer and a discrete VQ-GAN tokenization pipeline to achieve disentangled and physically plausible modeling of visual environments, yielding superior performance and sample efficiency on robotics manipulation benchmarks (Zhong et al., 25 Aug 2025).

1. Model Architecture and Visual Chain of Thought Reasoning

FlowVLA’s architecture consists of a decoder-only autoregressive Transformer that processes sequences of discrete tokens derived from both appearance (RGB images) and motion (optical flow images). Each time step in the sequence is represented as follows:

$v_t \rightarrow f_t \rightarrow v_{t+1}$

Here, $v_t$ is the current frame, $f_t$ is the model-predicted optical flow representing pixel-level displacement and directionality, and $v_{t+1}$ is the future frame prediction. Language instructions ( $L_{\text{instr}}$ ) may also be prepended in tasks requiring multimodal input, leading to input sequences such as $\{L_{\text{instr}}, v_0, f_0, v_1, f_1, \ldots\}$ .

The model is supervised with the following objective:

$\mathcal{L}_{WM} = \sum_t [\mathcal{L}_{CE}(f_t | S_{<v_{t+1}}) + \lambda \cdot \mathcal{L}_{CE}(v_{t+1} | S_{<v_{t+1}}, f_t)]$

$S_{<v_{t+1}}$ comprises all tokens prior to the next-frame prediction. The loss parameter $\lambda$ balances the cross-entropy terms and is set to 1.0 in experiments. This structure ensures that the future frame is only predicted after the model has internally reasoned about motion through $f_t$ .

2. Tokenization, Sequence Construction, and Visual-Motion Disentanglement

Visual inputs (frames and optical flow images) are discretized using a vector quantized generative adversarial network (VQ-GAN), which maps high-dimensional pixel data into shared discrete token sequences. The appearance and motion tokens are interleaved in the autoregressive modeling stage. The optical flow image is constructed by projecting the flow field $(u, v)$ into HSV color space: direction $\alpha = \arctan2(v, u)$ is mapped to hue, and speed $m = \sqrt{u^2 + v^2}$ normalized to saturation/value, allowing for low-entropy RGB encoding compatible with standard tokenizer vocabularies.

This explicit intermediate representation enables disentanglement of static appearance from motion dynamics, promoting physically plausible transitions that avoid conflating pixel texture and positional changes.

3. Unified Transformer Modeling and Training

The autoregressive Transformer receives token sequences that may include language instructions ( $L_{\text{instr}}$ ), appearance ( $v_t$ ), and motion ( $f_t$ ) information. Training uses teacher-forcing, optimizing for joint likelihood across appearance and motion tokens. During inference, the chain-of-thought is enforced: the model predicts $f_t$ for each $v_t$ , then uses $(v_t, f_t)$ as input to generate $v_{t+1}$ .

This approach is markedly different from prior world modeling systems that predicted next frames directly, often entangling appearance and dynamics and resulting in blurred or inconsistent predictions, especially on physically intensive robotics tasks. By modeling the chain $v_t \to f_t \to v_{t+1}$ , FlowVLA incorporates physical constraints through its architecture, yielding higher fidelity predictions.

4. Optical Flow Representation and Its Role in Physical Reasoning

The optical flow $f_t$ serves as the "internal thought" step within the framework. Its representation, derived from the $(u, v)$ field to RGB encoding, ensures that pixel-wise motion vectors are explicitly predicted. This intermediate step anchors future frame generation in the underlying scene dynamics. Empirically, the intermediate flow predictions lead to perceptually sharper and physically consistent visual rollouts, particularly in environments involving robotic manipulation and nontrivial object trajectories.

A plausible implication is that such explicit reasoning over motion may improve transfer and generalization in environments where textures and object appearances are variable but underlying dynamics are shared.

5. Performance on Benchmarks and Sample Efficiency

In experimental evaluations on robotics manipulation suites such as LIBERO and SimplerEnv, FlowVLA achieves state-of-the-art success rates and robustness:

Benchmark	Task Success Rate (SOTA)	Sample Efficiency Improvement
LIBERO	Yes	~3× faster convergence
SimplerEnv	Yes	Higher robustness to domain shift

FlowVLA demonstrates faster convergence, attaining peak performance after approximately one-third the training iterations required by baseline models. Its explicit motion representation provides enhanced resilience to domain shifts including lighting variation, viewpoint changes, and visual distractors.

This suggests that separation of appearance and dynamics in sequence modeling leads not only to increased sample efficiency but also to improved robustness in transfer settings.

6. Applications in World Modeling and Policy Learning

The model’s chain-of-thought formulation underpins a more principled foundation for world modeling, facilitating downstream tasks such as:

Model-based reinforcement learning and efficient policy training.
Planning and control in dynamic visual environments (robotics, autonomous navigation).
Simulation and future prediction for video-based agents, where physical plausibility is critical.

After pre-training, FlowVLA is fine-tuned for policy learning by mapping observations to discrete action tokens, with the world model serving as a robust internal simulation of environment responses.

7. Future Research Directions

Potential future developments for FlowVLA, as outlined in the paper, include:

Scaling transformer architecture and visual-token vocabularies alongside dataset size for broader applicability.
Refining flow supervision with multi-scale or more advanced optical flow estimation to enhance dynamic fidelity.
Integration of additional sensory modalities or high-level reasoning objectives to expand the chain-of-thought beyond pure motion and appearance.
Extending the framework to other domains where world modeling demands strong physical inductive biases, such as autonomous vehicles or complex 3D simulation.

The chain-of-thought principle implemented in FlowVLA presents a generic mechanism for improving physical reasoning and sample efficiency in sequence modeling for perception, planning, and control across a range of embodied AI applications.

PDF Markdown Chat (Pro)

References (1)

FlowVLA: Thinking in Motion with a Visual Chain of Thought (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to FlowVLA.