AI Research Assistant for arXiv

Discover the latest research in AI/ML, mathematics, physics, and more

Done in 18s

Queued

The user provided an arXiv ID. The summarize arxiv arxiv paper tool is designed to summarize a specific arXiv paper given its ID. This directly matches the user's query and is the most logical next step to provide the user with the information they are looking for.

Summarizing LaTeX of "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics" (2506.01844)

Synthesized final response with models/gemini-2.5-flash-preview-04-17

Done in 18s

Processed 31,297 tokens in 1 step

SmolVLA is presented as a small, efficient, and community-driven Vision-Language-Action (VLA) model designed to make robotics more accessible and affordable. It addresses the limitations of existing large VLAs, which are typically massive (billions of parameters), costly to train, and resource-intensive for real-world deployment. SmolVLA aims to significantly reduce training and inference costs while maintaining competitive performance, making it suitable for training on a single consumer-grade GPU and deployment on consumer GPUs or even CPUs.

The core of SmolVLA consists of two main components: a compact pretrained Vision-LLM (VLM) for perception and an Action Expert trained with Flow Matching to predict actions. Given multiple images and a language instruction, the model outputs a chunk of low-level actions.

The VLM component utilizes SmolVLM-2 [marafioti2025smolvlm], a VLM optimized for multi-image inputs. To enhance efficiency, SmolVLA incorporates several key architectural choices:

Layer Skipping: Instead of using the features from the final layer of the VLM's LLM, the action expert uses features from an earlier layer ( $N$ , specifically $N=L/2$ ), effectively halving the computational cost of the VLM and the action expert that conditions on its output. This is motivated by findings that optimal features for downstream tasks may not always be at the last layer.
Visual Token Reduction: The model uses a minimal number of visual tokens (64 per frame) from the global image view, processed via a pixel shuffle operation, avoiding computationally expensive image tiling during inference.
Small Pretrained VLM: SmolVLA leverages a relatively small pretrained VLM backbone, contributing to its overall compact size.
Interleaved Attention in Action Expert: The action expert uses a Transformer architecture with interleaved cross-attention (CA) and causal self-attention (SA) layers. CA allows action tokens to attend to VLM features, while causal SA allows action tokens within a chunk to attend to past action tokens, promoting smoother action sequences.

Sensorimotor states and actions are processed through linear projection layers to match the dimensions of the VLM and action expert respectively. The action expert is trained using a Flow Matching objective, which predicts a vector field $\mathbf{u}(A_t^\tau \mid A_t) = \epsilon - A_t$ from noisy actions $A_t^\tau$ and VLM features $\mathbf{o}_t$ . The objective function is $\mathcal L ^\tau(\theta) = \mathbb{E}_{p(A_t \mid \mathbf{o}_t), q(A_t^\tau \mid A_t)} \left[ \left\| \mathbf{v}_{\theta}(A_t^\tau, \mathbf{o}_t) - \mathbf{u}(A_t^\tau \mid A_t) \right\|^2 \right]$ .

A key aspect of SmolVLA's implementation is its pretraining on community-driven datasets from platforms like Hugging Face. This addresses the scarcity and heterogeneity of robotics data compared to vision and language data. The dataset, consisting of fewer than 30k episodes (around 10.6M frames), is significantly smaller than those used by prior state-of-the-art VLAs. Challenges related to data heterogeneity and noisy task annotations are addressed by using an off-the-shelf VLM (Qwen2.5-VL-3B-Instruct) to auto-generate concise task descriptions and manually normalizing camera viewpoint names.

To further improve responsiveness and efficiency, SmolVLA introduces an asynchronous inference stack. Unlike traditional synchronous inference where the robot waits for a new action chunk prediction after executing the previous one (leading to idle time), asynchronous inference decouples action execution from observation processing and action prediction. A RobotClient continuously consumes actions from a queue while sending new observations to a PolicyServer (potentially remote) for new chunk prediction. A new chunk is requested when the current action queue falls below a certain threshold ( $g$ ) relative to the chunk size ( $n$ ). An observation similarity filter is used to avoid redundant server calls. This allows the robot to continue executing actions while the next chunk is being computed, reducing latency and enabling higher control rates, especially when inference is run on a more powerful remote server.

For practical implementation and training:

The model is trained end-to-end with imitation learning on community datasets and then evaluated in simulated (LIBERO [liu2023libero], Meta-World [yu2020metaworld]) and real-world settings using low-cost robots (SO100 and SO101 [cadene2024lerobot]).
Training uses PyTorch with bfloat16 precision, torch.compile(), and Hugging Face's accelerate for multi-GPU support. Pretraining was done on 4 GPUs but can run on a single one due to the model's size. The action expert is primarily trained while the VLM is frozen.
The main model size is 450 million parameters, with about 100 million in the action expert.
Hyperparameters include a global batch size of 256 for pretraining, cosine learning rate schedule, AdamW optimizer, image resolution of 512x512, and action chunk size of 50.

Experimental results show that SmolVLA achieves competitive success rates on both simulation and real-world benchmarks compared to larger models like $\pi_0$ [black2024pi_0] and ACT [zhao2023learningact], despite using significantly fewer parameters and less pretraining data. For instance, on Meta-World, SmolVLA (0.45B) without robotics pretraining performs competitively with $\pi_0$ (3.5B) pretrained on robotics data. Real-world evaluations on SO100 tasks show SmolVLA outperforming ACT and $\pi_0$ . The evaluation on SO101 demonstrates generalization capabilities to a different robot embodiment. The ablation studies provide practical guidance, showing the benefits of:

Interleaving cross and causal self-attention in the expert.
Using a causal attention mask for action tokens.
Leveraging features from early VLM layers ( $N=L/2$ ).
Increasing action expert capacity (up to 0.75x VLM dimension).
Using Flow Matching over L1 regression.
Feeding sensorimotor states to the VLM rather than directly to the expert.
Selecting an appropriate action chunk size (between 10 and 50).
Updating observations more frequently (e.g., every 1 or 10 steps) during synchronous inference for better responsiveness at the cost of compute.

The asynchronous inference evaluation shows that while maintaining similar success rates, it significantly reduces task completion time and allows the robot to perform more tasks within a fixed time limit compared to synchronous inference, demonstrating improved efficiency and responsiveness in practice.

The authors open-source their code, models, training data, and robot hardware designs to promote further research and adoption of affordable and efficient robotics. Limitations include the need for more diverse and larger datasets, potential for further model scaling without losing efficiency, and exploring alternative VLM backbones or joint training strategies with multimodal and robotics data. Future work could also investigate applying SmolVLA to more complex, longer-horizon tasks and exploring reinforcement learning alongside imitation learning.

PDF Markdown