VITA: Vision-to-Action Flow Matching Policy (2507.13231v1)

Published 17 Jul 2025 in cs.CV, cs.AI, and cs.RO

Abstract: We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

Summary

The paper introduces a novel flow matching framework that maps visual latent representations directly into action spaces without noise injection.
It employs an efficient architecture combining a vision encoder, action autoencoder, and an MLP-based flow network to reduce inference latency by up to 130%.
Experimental results validate VITA's competitive success rates across simulated and real-world tasks, establishing its potential for advanced robotic control.

VITA: Vision-to-Action Flow Matching Policy

The paper introduces VITA, a new framework for visuomotor control that leverages flow matching to map visual latent representations to action latent spaces. It presents a novel approach where latent images become the flow source, eliminating the inefficiencies associated with conventional conditioning mechanisms. The paper outlines several key components of VITA: an action autoencoder for dimensional matching and a straightforward ML paradigm using only MLP. Experimental results demonstrate its competitive performance across various simulated and real-world tasks, underlining its efficiency and state-of-the-art success rates.

Flow Matching from Vision to Action

VITA's core concept is to adapt flow matching for direct transformation between visual and action latents without noise injection (Figure 1). Traditional policies require Gaussian noise and extra conditioning modules to bridge the vision-action gap. VITA utilizes latent visual representations directly as the source distribution in flow matching, resolving constraints by learning action latencies that match visual dimensions. During inference, this process results in efficient, precise action prediction.

Figure 1: An overview of VITAâa noise-free, conditioning-free policy learning framework, achieving strong performance and inference efficiency across both simulation and real-world visuomotor tasks.

Efficient Policy Architecture

The VITA architecture (Figure 2) consists of simplified components: a vision encoder, action encoder-decoder, and flow matching network. The vision encoder maps observations to a 512-dimensional latent representation. The action autoencoder up-samples actions into corresponding latent dimensions, ensuring shape compatibility. The MLP-based flow network then learns the optimal velocity field to transform the latent vision into action via the learned ODE.

Figure 2: An overview of the VITA architecture, facilitating efficient representation translation through streamlined neural architecture components.

Implementation Details and Performance

VITA's implementation emphasizes simplicity and efficiency. VITA combines lightweight encoders with MLP flow networks for lower computational overhead. This architecture allows VITA to reduce policy inference latency by 50% to 130% compared to traditional generative models (Table 1). It demonstrated state-of-the-art results in simulated ALOHA tasks and competitive performance in real-world settings, evidencing the effectiveness of its end-to-end flow latent decoding.

Experimental Evaluation

VITA was rigorously evaluated against state-of-the-art models across six simulation tasks and two real-world tasks. Simulation metrics illustrated VITA's superior or equivalent success rates compared to diffusion and autoregressive competitors (Table 2). The results highlight VITA's ability to efficiently learn viable policies in both simulated and operational environments, facilitated by its compact architecture (Figures 3 and 4).

Figure 3: An illustration of five simulation AV-ALOHA tasks, CubeTransfer, SlotInsertion, HookPackage, PourTestTube, and ThreadNeedle.

Figure 4: An illustration of two challenging real-world AV-ALOHA tasks, HiddenPick, and TransferFromBox. The pictures are taken from autonomous rollouts by the VITA policy.

Ablation and Analysis of Flow Latent Decoding

A key innovation in VITA is the backpropagation of flow latent decoding. This allows end-to-end joint optimization of the vision and action latent spaces. The critical flow decoding loss (Figure 5) demonstrated its necessity for successful policy training, as ablations show a dramatic performance drop without it. The policy reaches optimal success rates through effective coupling of learned visions to executable actions.

Figure 5: Ablation paper on flow latent decoding, comparing different objective functions for supervising the predicted action latent.

Conclusion

VITA illustrates a substantial advancement in efficient, generative visuomotor policies. By effectively leveraging latent space matching and eliminating the need for complex conditioning modules, it sets a precedent for high-performance policy deployment with minimal inference latency. The framework's approaches can serve as a blueprint for future developments in both simulated and real-world robot manipulation tasks.

VITA: Vision-to-Action Flow Matching Policy (2507.13231v1)

Summary

VITA: Vision-to-Action Flow Matching Policy

Flow Matching from Vision to Action

Efficient Policy Architecture

Implementation Details and Performance

Experimental Evaluation

Ablation and Analysis of Flow Latent Decoding

Conclusion

Follow-up Questions

Authors (9)

YouTube

Don't miss out on important new AI/ML research

VITA: Vision-to-Action Flow Matching Policy (2507.13231v1)

Summary

VITA: Vision-to-Action Flow Matching Policy

Flow Matching from Vision to Action

Efficient Policy Architecture

Implementation Details and Performance

Experimental Evaluation

Ablation and Analysis of Flow Latent Decoding

Conclusion

Follow-up Questions

Related Papers

Authors (9)

YouTube

Don't miss out on important new AI/ML research