SmolVLM: Redefining small and efficient multimodal models (2504.05299v1)

Published 7 Apr 2025 in cs.AI and cs.CV

Abstract: Large Vision-LLMs (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

Authors (17)

Andrés Marafioti (8 papers)
Orr Zohar (9 papers)
Miquel Farré (3 papers)
Merve Noyan (2 papers)
Elie Bakouch (5 papers)
Pedro Cuenca (1 paper)
Cyril Zakka (14 papers)
Loubna Ben Allal (12 papers)
Anton Lozhkov (7 papers)
Nouamane Tazi (8 papers)
Vaibhav Srivastav (6 papers)
Joshua Lochner (2 papers)
Hugo Larcher (2 papers)
Mathieu Morlon (6 papers)
Lewis Tunstall (13 papers)
Leandro von Werra (19 papers)
Thomas Wolf (117 papers)

Summary

SmolVLM: Redefining Small and Efficient Multimodal Models

The paper introduces SmolVLM, a novel approach in the domain of Vision-LLMs (VLMs) specifically engineered to address the computational inefficiencies associated with their larger counterparts like Flamingo and Idefics. The research focuses on developing compact multimodal models that maintain high performance while reducing the computational resources required, making them more suitable for deployment on mobile and edge devices.

Architectural Innovations and Design Choices

SmolVLM models are designed with significant architectural optimizations focusing on minimizing GPU memory usage during inference. The smallest variant, SmolVLM-256M, operates with less than 1GB of GPU memory, outperforming much larger models such as Idefics-80B, demonstrating the potential for high efficiency in smaller-scale VLMs. The largest SmolVLM model, with 2.2 billion parameters, rivals state-of-the-art models, consuming approximately half of the GPU memory.

The paper details several innovative strategies in model design:

Balanced Compute Allocation: Through methodical investigation of encoder-LM parameter balance, SmolVLM demonstrates that smaller vision encoders effectively complement compact LMs, thereby optimizing performance by ensuring an ideal distribution of computational resources.
Tokenization and Compression: The researchers have implemented aggressive token compression techniques like pixel shuffle, reducing the number of visual tokens significantly while maintaining spatial fidelity, which is essential for tasks such as OCR.
Extended Context Length: Extending the context window up to 16k tokens enhances the models' capability to process larger and more complex visual content, thereby improving both image and video task performances.

Instruction Tuning and Training Data

Instruction tuning for SmolVLM requires careful consideration, particularly in tokenization and prompt structuring. The research highlights that learned positional tokens outperform string-based tokens, especially in smaller models, leading to enhanced OCR accuracy and stability during training. A structured approach in prompts, along with masking strategies during supervised fine-tuning, proved beneficial for task generalization.

Training the models utilized a mixture of diverse datasets, including synthetic data for both vision and video tasks, ensuring a balanced approach that doesn't compromise the linguistic capabilities of the model.

Evaluation and Performance

SmolVLM's performance has been rigorously evaluated across various benchmarks and compared with efficient state-of-the-art models. The results are compelling, with SmolVLM maintaining strong performance with significantly less memory usage. For instance, the SmolVLM-256M achieves competitive scores against much larger models, yielding average accuracy improvements on demanding benchmarks like Video-MME and AI2D.

The models' ability to generalize to video tasks is particularly noteworthy, achieving competitive scores with minimal computational overhead, thus enhancing their suitability for real-time, on-device applications.

Implications and Future Directions

The implications of this research are substantial, paving the way for practical deployment of multimodal models in environments where computational resources are limited. The open-source nature of the SmolVLM models and their associated datasets promotes further research and innovation in the field.

Looking forward, the principles established in this work could influence future developments in AI by prioritizing computational efficiency without sacrificing performance. As AI systems become increasingly integrated into everyday devices, models like SmolVLM demonstrate that it's possible to maintain high capabilities at a fraction of the cost, potentially transforming how these technologies are applied in mobile and edge environments. This research not only addresses current limitations but also sets a new standard for efficient AI model design, with promising avenues for further exploration in multimodal and real-time applications.