Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 30 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 116 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model (2509.09372v1)

Published 11 Sep 2025 in cs.RO

Abstract: Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-LLM (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

Summary

The paper demonstrates a lightweight VLA paradigm achieving SOTA performance with only a 0.5B-parameter backbone without robotic data pre-training.
The model introduces a novel Bridge Attention mechanism that integrates raw and ActionQuery features to enhance action generation.
Efficient training on consumer GPUs with rapid inference speeds and robust performance on benchmarks and real-world tasks highlight its practical impact.

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Introduction and Motivation

The VLA-Adapter framework addresses a central challenge in Vision-Language-Action (VLA) models: efficiently bridging high-dimensional vision-language (VL) representations to the action (A) space, especially under constraints of model size and pre-training resources. Existing VLA models typically rely on large-scale Vision-LLMs (VLMs) pre-trained on extensive robotic datasets, incurring significant computational and memory costs. VLA-Adapter proposes a lightweight paradigm that achieves state-of-the-art (SOTA) performance with a 0.5B-parameter backbone, eliminating the need for robotic data pre-training and enabling rapid training on consumer-grade hardware.

Bridging Paradigms: Systematic Analysis

The paper provides a systematic analysis of bridging paradigms from VL to A, categorizing prior approaches into two main types: (1) direct use of raw VLM features (from final or intermediate layers), and (2) learnable queries (ActionQuery) as interfaces between VLM and Policy networks.

Figure 1: Existing representative bridge paradigms from VL to A.

Key findings from ablation studies on the LIBERO-Long benchmark reveal:

Middle-layer raw features ( $\mathcal{C}_t^\mathcal{R}$ ) outperform deep-layer features for action generation, as they retain richer multimodal details.
Deep-layer ActionQuery features ( $\mathcal{C}_t^\mathcal{AQ}$ ) are more effective than shallow layers, due to aggregation of multimodal information during training.
Multi-layer features (all layers) consistently outperform single-layer features, providing universality and obviating the need for layer selection.

These insights motivate the VLA-Adapter design, which leverages both all-layer raw and ActionQuery features as conditions for the Policy network.

VLA-Adapter Architecture

The VLA-Adapter framework consists of a compact VLM backbone (default: Prismatic VLM on Qwen2.5-0.5B), a Policy network with Bridge Attention, and a flexible conditioning mechanism.

Figure 2: The proposed VLA framework. Key components include effective condition exploration and Bridge Attention design.

Bridge Attention Mechanism

The Policy network employs a novel Bridge Attention module at each layer, integrating both raw and ActionQuery features with the action latent. The architecture comprises:

Two cross-attention blocks: one for raw features ( $\mathcal{C}_t^\mathcal{R}$ ), modulated by a learnable ratio $g$ (initialized to zero, $\tanh$ -activated), and one for ActionQuery features ( $\mathcal{C}_t^\mathcal{AQ}$ ) concatenated with proprioceptive state.
One self-attention block: operating on the action latent.
Concatenation and residual feed-forward network (FFN): producing the next-layer action latent.
Figure 3: The Policy with Bridge Attention. Only 97M Policy parameters for Qwen2.5-0.5B backbone.

This design enables selective and learnable injection of multimodal information into the action space, maximizing the utility of both raw and ActionQuery features.

Training and Implementation Details

VLA-Adapter is trained end-to-end from scratch, using AdamW optimizer and LoRA for efficient fine-tuning. The objective minimizes the $L_1$ distance between predicted and ground-truth action trajectories. Hyperparameters are chosen for stability and efficiency (batch size 16, learning rate $1e^{-4}$ , cosine annealing, 150k steps).

The framework supports both L1-based and DiT-based (Diffusion Transformer) Policy architectures. Empirical results favor the L1-based Policy for superior performance and inference speed in the fine-tuning regime.

Experimental Results

LIBERO Benchmark

VLA-Adapter achieves SOTA-level performance on the LIBERO benchmark, outperforming or matching large-scale models (OpenVLA-OFT, UnifiedVLA) with only 0.5B backbone parameters. Notably, it surpasses VLA-OS by 29.0% on LIBERO-Long and demonstrates robust performance across all task suites.

Figure 4: Comparison on real-world tasks.

CALVIN ABC→D Generalization

On the CALVIN ABC→D zero-shot generalization benchmark, VLA-Adapter attains the highest average task completion length (4.42), exceeding all baselines, including those with significantly larger backbones and those trained from scratch.

Figure 5: The example and task completion conditions on the CALVIN ABCtoD.

Real-World Robotic Tasks

VLA-Adapter is validated on a 6-DOF Synria Alicia-D robot with randomized object positions, demonstrating strong generalization and execution capabilities in both short-horizon and long-horizon manipulation tasks.

Figure 6: Real-world system Synria Alicia-D and the task examples.

Figure 7: Execution example on the real-world tasks.

Efficiency and Resource Requirements

VLA-Adapter achieves the fastest inference speed (219.2 Hz throughput, 0.0365 sec latency) among all compared methods, with low VRAM usage and rapid training (8 hours on a single consumer GPU). It remains effective even when the VLM backbone is frozen, outperforming SmolVLA and OpenVLA-OFT in this regime.

Figure 8: Execution example when the backbone is frozen.

Ablation Studies

Comprehensive ablations confirm:

Optimal ActionQuery count is 64: balancing multimodal aggregation and computational efficiency.
Joint use of all-layer raw and ActionQuery features yields best performance.
Learnable injection degree for raw features is critical; ActionQuery features should be fully injected.
Figure 9: Comparison of the different numbers of ActionQuery. The blue line shows the result of using only the last-layer ActionQuery. The red star shows the result of the full VLA-Adapter.

Implications and Future Directions

VLA-Adapter demonstrates that high-performance VLA models can be realized without large-scale VLMs or robotic data pre-training, significantly lowering the barrier to deployment in resource-constrained settings. The paradigm enables rapid prototyping, efficient fine-tuning, and real-world applicability.

Theoretically, the systematic analysis of bridging paradigms provides a foundation for future research on multimodal representation transfer and action policy design. Practically, the framework is extensible to other embodied AI domains, including mobile robotics and humanoid control.

Future work may explore:

Enhanced generalization via improved condition representations or hybrid pre-training strategies.
Integration of reinforcement learning for more complex policy optimization.
Application to broader real-world scenarios and hardware platforms.

Conclusion

VLA-Adapter introduces an effective, resource-efficient paradigm for bridging vision-language representations to action in VLA models. By leveraging both raw and ActionQuery features with a learnable Bridge Attention mechanism, it achieves SOTA performance with minimal computational overhead. The framework's scalability, efficiency, and robustness position it as a strong candidate for future embodied AI research and deployment.