SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics (2506.01844v1)

Published 2 Jun 2025 in cs.LG and cs.RO

Abstract: Vision-LLMs (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

Summary

The paper introduces a lightweight vision-language-action model that halves computational costs while achieving up to 78.3% success on SO100 tasks.
It employs techniques like layer skipping, Flow Matching, and asynchronous inference to enhance training efficiency and real-time responsiveness.
Pretraining on 23k episodes from community datasets enables robust performance across both simulated benchmarks and real-world robotic tasks.

This paper introduces SmolVLA (2506.01844), a Vision-Language-Action (VLA) model designed to address the high computational costs and limited accessibility of existing large VLA models for robotics. The authors propose SmolVLA as a small, efficient, and open-source solution trainable on a single GPU and deployable on consumer-grade hardware. A key aspect is its training on publicly available, community-contributed datasets, leveraging affordable robotic platforms like the SO-100.

The core problem SmolVLA aims to solve is democratizing VLA training and deployment. Current state-of-the-art VLAs are often massive, requiring significant computational resources and relying on large, often proprietary, datasets. This limits their use to well-funded labs or companies. SmolVLA seeks to enable broader participation by providing a lightweight model and efficient training/inference recipes.

Key Technical Contributions:

Lightweight Architecture: SmolVLA consists of a compact, pretrained Vision-LLM (VLM) and an Action Expert trained with Flow Matching.
- Efficient VLM: It uses SmolVLM-2 as the backbone, which is optimized for multi-image input. To further reduce costs, it avoids image tiling and limits visual tokens to 64 per frame using pixel shuffling.
- Faster Inference via Layer Skipping: Instead of using the full VLM, the Action Expert is conditioned on features from only the first $N$ layers (empirically $N=L/2$ works well), effectively halving the computational cost of the VLM and Action Expert.
- Efficient Action Expert: The Action Expert uses a Transformer architecture with interleaved cross-attention (attending to VLM features) and causal self-attention (attending to past action tokens). It uses a reduced hidden size (0.75x VLM dimension) for efficiency. It is trained using a Flow Matching objective to predict action chunks.
Pretraining on Community Data: SmolVLA is trained end-to-end on approximately 23k episodes (about 10.6M frames) from 481 publicly available community datasets, primarily from SO-100 robots. This dataset is significantly smaller than those used by larger VLAs.
- Data Challenges: Using diverse community data introduces challenges like inconsistent task annotations and camera naming conventions.
- Data Processing: They address task annotation noise by using an off-the-shelf VLM (Qwen2.5-VL-3B-Instruct) to auto-generate concise task descriptions. Camera viewpoints are manually mapped to standardized views (top, wrist, side) and consistently named.
Asynchronous Inference: To improve responsiveness and efficiency in deployment, especially on resource-constrained robots or when using remote compute, the authors introduce an asynchronous inference stack.
- Decoupling: This stack decouples observation processing and action prediction (PolicyServer) from action execution (RobotClient).
- Action Queue and Threshold: The RobotClient consumes actions from a queue. When the queue size drops below a certain threshold ( $g$ ), the client triggers the PolicyServer to process a new observation and predict a new action chunk.
- Observation Filtering: A similarity filter is applied to new observations (based on joint-space distance) to avoid redundant processing of near-duplicate states, preventing the robot from stalling with constant queue updates.
- Benefits: This reduces idle time (lags) during computation, allows leveraging more powerful remote hardware, and increases the frequency at which new observations are processed, leading to faster and more responsive control.

Implementation Details:

SmolVLA is built using LeRobot, a PyTorch-based framework.
Pretraining involves 200,000 steps with a batch size of 256 across community datasets. Fine-tuning is done for fewer steps (100k for simulation, 200k for real-world) with smaller batch sizes (64).
The VLM backbone is frozen, and only the Action Expert is trained.
The main SmolVLA model has 450 million parameters, with ~100 million in the Action Expert.
Optimizations like bfloat16 precision and torch.compile() are used for training efficiency. Multi-GPU training is supported via accelerate.
Pretraining required ~30k GPU hours but can be done on a single GPU due to the model size.
Action chunks have size $n=50$ for real-world evaluation (synchronous inference, evaluating new chunk after full execution) and per-step for simulation (asynchronous-like, new chunk after each action). Flow Matching inference uses 10 steps.

Experimental Evaluation:

Simulated Benchmarks: Evaluated on LIBERO (40 tasks) and Meta-World (50 tasks), using multi-task training.
- SmolVLA (0.45B and 2.25B variants) achieved competitive or superior success rates compared to baselines like Diffusion Policy, Octo, OpenVLA, and $\pi_0$ , despite not being pretrained on robotics data like some baselines. The larger 2.25B variant performs best in simulation.
Real-World Tasks (SO100): Evaluated on Pick-Place, Stacking, and Sorting tasks.
- SmolVLA (0.45B), pretrained on community data and multi-task finetuned, significantly outperformed ACT (single-task trained) and $\pi_0$ (multi-task finetuned) in average success rate.
Real-World Generalization (SO101): Evaluated on a Pick-Place-Lego task on a different robot (SO101) not seen during pretraining.
- SmolVLA outperformed ACT in both in-distribution and out-of-distribution (novel object positions) settings.
Effect of Pretraining: Pretraining on community datasets substantially improved real-world performance on SO100 tasks (from 51.7% avg SR without pretraining to 78.3% with pretraining). Multi-task fine-tuning also provided benefits.
Asynchronous Inference Evaluation: Compared sync vs. async inference on SO100 real-world tasks.
- Async inference maintained comparable success rates.
- Async inference significantly reduced task completion time (~30% faster).
- In a fixed time limit, async inference allowed the robot to complete more tasks (19 vs 9 Pick-Place cycles in 60 seconds).
- Qualitatively, async inference led to faster reactions and improved robustness.

Ablation Studies:

Attention Mechanism (VLM to Action Expert): Interleaving cross-attention (CA) and causal self-attention (SA) yielded the best performance on LIBERO compared to CA-only or SA-only.
Attention Mask (Action Tokens): Causal self-attention (masking future tokens) in the Action Expert outperformed bidirectional attention, suggesting preventing future information leakage is beneficial. Pure CA (no interaction between action tokens) was surprisingly competitive.
VLM Layers: Using features from the first $N=L/2$ VLM layers provided a good trade-off between performance and speed. Skipping every second layer was competitive but worse than using the first half. Using a smaller VLM backbone from scratch performed worse than skipping layers in a larger VLM.
Action Expert Capacity: Reducing the hidden dimension to 0.75x VLM dimension offered a good balance. Larger capacity generally led to better success rates.
Training Objective: Flow Matching significantly outperformed standard L1 regression loss for the Action Expert.
State Input: Feeding sensorimotor states to the VLM (as tokens) resulted in better performance than feeding them directly to the Action Expert.
Action Chunk Size ( $n$ ): Chunk sizes between 10 and 50 provided a good balance. Very small ( $n=1$ ) or very large ( $n=100$ ) sizes reduced performance.
Action Execution Steps (between observations): Updating observations more frequently (e.g., every 1 or 10 actions executed from the chunk) significantly improved success rate compared to executing the full chunk before updating.

Discussion and Limitations:

SmolVLA demonstrates that competitive VLA performance can be achieved with a much smaller model and efficient training on accessible community data. The asynchronous inference stack offers practical benefits for real-world deployment speed and responsiveness.

Identified limitations include:

Limited dataset diversity (primarily SO100), hindering cross-embodiment generalization.
Relatively small dataset size compared to other large VLAs.
Scope for further scaling the model while maintaining efficiency.
The choice of VLM backbone (SmolVLM-2, trained on document tasks) may not be optimal for robotics.
Potential benefits of jointly training on robotics and broader multimodal data.
Challenges in scaling to longer-horizon and more complex tasks.
The current reliance on imitation learning, suggesting reinforcement learning could be beneficial for some tasks.

The authors release their code, pretrained models, training data, and robot hardware details to promote open research and reproducibility.