Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (2502.19645v2)

Published 27 Feb 2025 in cs.RO, cs.AI, cs.CV, and cs.LG

Abstract: Recent vision-language-action models (VLAs) build upon pretrained vision-LLMs and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($\pi_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.

Summary

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

The paper under review meticulously assesses the complexities and intricacies associated with fine-tuning Vision-Language-Action (VLA) models, particularly examining when these models are adapted to novel robotic setups. Although these models have demonstrated efficacy in tasks involving semantic generalization and following complex language-based directions, their adaptability to new contexts remains hindered without an optimized fine-tuning process. This research delineates essential design components for fine-tuning these models effectively, highlighting key strategies to enhance both inference efficiency and task success rates.

Key Contributions

The authors propose an Optimized Fine-Tuning (OFT) recipe designed to refine VLA models, specifically using the OpenVLA as a representative framework. This methodology integrates:

Parallel Decoding: departing from the traditional autoregressive schemes, parallel decoding accelerates the generation process by enabling simultaneous predictions across all action dimensions.
Action Chunking: aligns chunks of actions for simultaneous execution, further reducing latency and enhancing control accuracy.
Continuous Action Representation with L1 Regression: as opposed to discrete tokens, this strategy facilitates nuanced action specification, fully leveraging the fine-grained continuous space.

Empirical Analysis

Empirical evaluations were conducted using the LIBERO simulation benchmark as well as real-world tasks on the ALOHA bimanual robotic setup. Results showcased a dramatic improvement in both computational throughput and real-time response efficiency. OpenVLA-OFT achieved a 97.1% success rate over various tasks in LIBERO, surpassing prior results by a substantial margin. The processing of action chunks enabled throughput enhancements of up to 26 times compared to the autoregressive baseline.

On ALOHA, the adaptable model confirmed its strength in bimanual manipulation tasks requiring high-frequency control. Notably, OFT+—augmented by feature-wise linear modulation (FiLM) for enhanced language grounding—outperformed both state-of-the-art imitative learning policies and recent VLAs like RDT-1B and $\pi_0$ . The modular inclusion of FiLM notably increased accuracy in language-dependent tasks by counteracting non-semantic feature latching, establishing true reliance on linguistic directions.

Implications and Future Prospects

The findings substantiate the principal claim that design intricacies in fine-tuning VLAs often dictate success in deployment under novel conditions. The flexible input-output model architecture ensured adaptability to different hardware configurations without sacrificing performance efficiency. Implementing such optimized recipes allows VLA models to abound under resource and temporal constraints, making them more practical for diverse robotic applications.

Future research should pivot towards exploring the integration of these fine-tuning methodologies earlier in the pretraining phases, potentially examining OFT's principles under more data-diverse pretraining settings. The implications extend to broadening the gamut of robotic control applications, opening avenues for embedding such models in more fluid and dynamic environments.

This work courses through a reduplicated architecture schema, ensuring that models not only perform with higher speed and accuracy but also manifest a higher tier of adaptability and cross-task efficiency. Such an enhancement in language grounding and real-time inference speed adorns VLAs for broader industrial adoption and real-world application. This instantiation of OFT serves as a robust blueprint, strategically refining the intersection of vision, language, and robotic action.

Related Papers

GitHub

Tweets

https://twitter.com/kvablack/status/1931874064022577380

https://twitter.com/semisance/status/1895420422696673333

https://twitter.com/taziku_co/status/1895690574956806184

YouTube

Show All Videos