Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
109 tokens/sec
GPT-4o
79 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
15 tokens/sec
GPT-4.1 Pro
76 tokens/sec
DeepSeek R1 via Azure Pro
54 tokens/sec
2000 character limit reached

SmolVLA (Efficient Vision-Language-Action Model)

Last updated: June 9, 2025

SmolVLA: Efficient Community-Driven Vision-Language-Action Models for Affordable Robotics

SmolVLA introduces a compact, open-source vision-language-action ° (VLA °) model that addresses key scalability and accessibility challenges in robotics. By combining a lightweight architecture, efficient inference, and training on open, community-contributed datasets, SmolVLA matches or surpasses the performance of much larger models while greatly reducing the computational and hardware requirements necessary for robot policy learning and deployment (Shukor et al., 2 Jun 2025 ° ).

Significance and Background

Vision-LLMs (VLMs) pretrained on large-scale multimodal datasets ° have become foundational for robotic policy learning, providing strong perceptual representations for grounding language instructions ° to sensor data. However, most existing VLA models °, adapted from VLMs, are exceedingly large—often exceeding several billion parameters—resulting in high training and inference costs and limiting deployment to high-end research computing environments. These approaches typically rely on large, sometimes proprietary, datasets housed within major academic or industrial research groups.

SmolVLA directly addresses these limitations by introducing a small-scale VLA architecture—less than 0.5B parameters—that can be trained and deployed on consumer GPUs ° or CPUs °. The training corpus ° consists solely of open, community-contributed robotics datasets. SmolVLA employs an asynchronous inference stack, further improving real-time responsiveness and efficiency ((Shukor et al., 2 Jun 2025 ° ), Sections 3–5).

Foundational Concepts

Architectural Design

SmolVLA is composed of two primary modules ((Shukor et al., 2 Jun 2025 ° ), Section 3):

  • Pretrained Vision-LLM (VLM °): Responsible for perception, the VLM processes tokenized natural language instructions along with RGB images (up to three camera views, each reduced to 64 visual tokens via spatial compression). Critically, only the early layers of the VLM are used—this "layer skipping" strategy significantly reduces computational burden while preserving downstream performance °.
  • Action Expert: A transformer-based module ° dedicated to predicting continuous robot action trajectories. The action expert receives VLM features °, current sensorimotor state, and instruction tokens ° as input. Its architecture alternates cross-attention ° to perception features and causal self-attention among action tokens, allowing for both multimodal context and autoregressive consistency. Actions are output in chunks rather than single time-steps, supporting efficient batching and reducing policy lag.

These architectural choices enable SmolVLA to perform real-time inference ° on commodity hardware ° and allow for rapid fine-tuning on additional data as needed ((Shukor et al., 2 Jun 2025 ° ), Sections 3.2–3.3).

Asynchronous Inference Stack

SmolVLA’s inference stack decouples perception and action prediction ° from physical execution using a queue-based streaming system ((Shukor et al., 2 Jun 2025 ° ), Section 5.4). Actions are generated in chunks and streamed to the robot. As the action queue depletes below a threshold, a new observation is captured, and the next chunk is inferred—often before the previous chunk is fully consumed. This design, which leverages observed similarity in robot joint-space to avoid redundant updates, effectively maintains continuous action availability and improves system responsiveness, reducing latency by approximately 30% and doubling effective throughput compared to synchronous loops.

Key Developments and Empirical Findings

Training and Data

SmolVLA is trained exclusively on approximately 23,000 episodes sourced from 481 open, community-contributed robotics datasets—primarily consisting of manipulation tasks performed by SO-100 robots ((Shukor et al., 2 Jun 2025 ° ), Section 4). Data curation involved standardizing task labels, normalizing camera views, and automatic annotation ° using a compact pretrained VLM for missing or noisy instructions. Only the action expert is fine-tuned; the VLM parameters remain frozen throughout.

The entire pretraining regime (200k steps, batch size 256, bfloat16) is optimized for consumer hardware °, requiring about 30,000 GPU ° hours—an order of magnitude less computational expense than prior large-scale VLAs ° ((Shukor et al., 2 Jun 2025 ° ), Section 5.3).

Performance and Benchmarking

Simulation Benchmarks

In simulation environments ° (MetaWorld °, LIBERO °), SmolVLA (0.45B parameters) achieves:

  • 57.3% average success rate on MetaWorld tasks.
  • 87.3% average success rate on LIBERO tasks.

These results are on par with or better than models such as π₀ (3.5B parameters, 47–50%/86% on MetaWorld/LIBERO) and OpenVLA ° (7B parameters, 76.5–88% on LIBERO), despite SmolVLA's much smaller size and lower compute requirements ((Shukor et al., 2 Jun 2025 ° ), Table 1).

Real-World Benchmarks

In real-world experiments with SO-100 and SO-101 robots—across tasks like pick-place, stacking, and sorting—SmolVLA (0.45B) achieves a 78.3% average multi-task success rate, outperforming π₀ (61.7%) and ACT ° (80M, 48.3%) under similar conditions ((Shukor et al., 2 Jun 2025 ° ), Section 5.5).

Efficiency

  • Memory and Computation: SmolVLA uses 6–10× less memory and is 40% more compute-efficient than leading alternatives at similar or higher performance levels (Shukor et al., 2 Jun 2025 ° ).
  • Practicality: The model runs in real-time on CPUs and commodity GPUs °.
  • Async Inference: The asynchronous stack reduces average task time by ~30% and doubles effective throughput, maintaining or increasing success rates ((Shukor et al., 2 Jun 2025 ° ), Fig. 2 and Table 5).

Ablation and Efficiency Analysis

Ablation studies establish that:

  • Layer skipping, aggressive visual token reduction, and action expert design are essential for high performance at low cost.
  • Using VLM features from earlier layers, rather than the deepest, is more efficient and can improve task performance for control ((Shukor et al., 2 Jun 2025 ° ), Section 5.6).

Comparative Table

Aspect SmolVLA (0.45B) π₀ (3.5B) OpenVLA (7B)
Training Episodes ~23k (community) 1M+ (proprietary) 1M+
Inference HW Consumer GPU/CPU High-end GPUs/TPUs GPUs/TPUs
Real Robot SR 78.3% (SO-100/101) 61.7% n/a
Sim. SR (MW/LB) 57–68% / 87% 47–50% / 86% 76.5–88%
Memory/Compute 1/10th of SOTA ° Very high High
Features Interl./causal attn.; async inf.

Current Applications and State of the Art

SmolVLA is tested on a range of manipulation tasks using both simulated environments ° and physical robots. The model accepts natural language instructions, arbitrary camera inputs, and sensor states, producing efficient and robust policies ° suitable for rapid research prototyping and real-world deployment—without retraining the perception backbone.

All code, pretrained models, and datasets are publicly released, supporting reproducibility and further community-led advancement (Shukor et al., 2 Jun 2025 ° ).

Comparative Assessment

SmolVLA distinguishes itself from prior VLAs by:

  • Enabling practical robot control on consumer hardware.
  • Reducing control latency and improving throughput via asynchronous chunked inference.
  • Leveraging open, heterogeneous, community-driven datasets, moving away from proprietary data.
  • Achieving competitive or superior task completion at a fraction of the scale and computational footprint of larger models.

Emerging Trends and Future Directions

The SmolVLA framework highlights several key trends in robotics and AI:

  • Architectural Compression and Efficiency: Smaller VLA models, when designed with judicious feature selection (layer skipping, token reduction) and efficient attention mechanisms, can match the performance of much larger systems.
  • Community-First Data Collection: Using open and diverse real-world datasets improves policy robustness ° and coverage while reducing reliance on closed data.
  • Asynchronous Control Models: Decoupling action prediction from execution is becoming a best practice for maximizing real-world productivity and control responsiveness.
  • Democratization: SmolVLA’s public release and hardware accessibility promote broader participation in robot learning and application.

Limitations persist, particularly in the evaluation of long-horizon, dexterous, or multi-agent tasks °. The reliance on a frozen VLM also raises questions about adaptation to drastically novel environments—a challenge shared across competing approaches (Shukor et al., 2 Jun 2025 ° ).

Conclusion

SmolVLA demonstrates that small, efficiently engineered VLA models can offer high-quality robot learning and control with affordable resource requirements. By combining a compact architecture, rigorous data curation, and asynchronous inference design, SmolVLA achieves performance on par with or exceeding that of established billion-parameter models, delivering broader accessibility for research, education, and real-world deployment. The model, codebase, and datasets are released for community use, providing a strong foundation for collaborative advancement in multimodal robot learning (Shukor et al., 2 Jun 2025 ° ).


Speculative Note

Broader application of SmolVLA’s design—integrating open, community data, minimal-sized architectures, and asynchronous decision pipelines—may influence other AI fields that require scalable, real-time multimodal inference ° under practical hardware constraints. Further advances are likely to depend on both technical innovation in perception-action reasoning and continued growth of open data ° collaborations.


References