Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 129 tok/s Pro

GPT OSS 120B 430 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

SAIL-VL2 Technical Report (2509.14033v1)

Published 17 Sep 2025 in cs.CV

Abstract: We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

Summary

The paper presents a progressive training framework and modular architecture that enable efficient multimodal understanding with moderate parameter scales.
The model leverages a high-quality data curation pipeline and synthetic VQA generation to enhance visual grounding and diverse reasoning tasks.
Empirical results show that SAIL-VL2 achieves state-of-the-art performance across 106 benchmarks, challenging the need for massive parameter counts.

SAIL-VL2: An Open-Suite Vision-Language Foundation Model for Efficient Multimodal Understanding and Reasoning

Introduction

SAIL-VL2 represents a significant advancement in the design and training of vision-LLMs (LVMs), targeting efficient yet high-performing multimodal understanding and reasoning at moderate parameter scales (2B, 8B, and MoE variants). The model suite is distinguished by three core innovations: a large-scale, quality-focused data curation pipeline; a progressive, multi-stage training framework; and architectural advances including both dense and sparse Mixture-of-Experts (MoE) LLM backbones. SAIL-VL2 achieves state-of-the-art results across 106 benchmarks, including challenging reasoning tasks such as MMMU and MathVista, and leads the OpenCompass leaderboard for open-source models under 4B parameters.

Model Architecture

SAIL-VL2 adopts a modular architecture comprising a vision encoder (SAIL-ViT), a lightweight vision-language adapter, and a flexible LLM backbone (Qwen3 series, both dense and MoE). The vision encoder is based on a progressively trained ViT, designed to align visual features with the LLM's representation space. The adapter, a two-layer MLP, projects visual embeddings into the language domain, facilitating joint multimodal processing.

Figure 2: SAIL-VL2 framework: SAIL-ViT encodes visual inputs, the adapter projects them into the LLM space, and the LLM processes both modalities for unified reasoning.

The architecture supports both fixed-resolution and arbitrary-resolution visual inputs, with SAIL-ViT-AnyRes employing interpolation-based positional embeddings for flexible input handling. The LLM backbone is instantiated with Qwen3-Instruct (dense) or Qwen3-MoE, with the latter leveraging sparse expert activation for parameter-efficient scaling.

Data Curation and Pre-Training

A central innovation in SAIL-VL2 is the data curation pipeline, which systematically scores and filters multimodal corpora for quality and diversity. SAIL-Caption2, the primary caption dataset, is refined using judge models trained to assess Visual Information Richness (VIR) and Image-Text Alignment (ITA), resulting in a high-quality corpus of 250M general and 1.69M chart captions. Synthetic VQA data is generated by transforming captions into QA pairs using LLMs, further expanding the training distribution.

Figure 1: SAIL-VL2 data construction pipeline: open-source and synthetic data are curated, filtered, and organized for different training stages.

Pre-training proceeds in two stages: basic multimodal pre-training (alignment and captioning/OCR tasks) and multi-task pre-training (instruction tuning, VQA, math reasoning). AdaLRS, a dynamic learning rate scheduler, is employed to optimize convergence during the initial stage. Data resampling at both dataset and linguistic levels mitigates distributional bias and enhances diversity.

Scaling experiments demonstrate monotonic performance improvements with increased data volume, particularly when synthetic VQA data is included, underscoring the importance of large, diverse corpora for robust multimodal alignment.

Figure 3: Scaling curves for SAIL-VL2-2B during multi-task pre-training, showing consistent gains across benchmarks as data volume increases.

Progressive Training and Post-Training Paradigms

The training pipeline is highly progressive, beginning with vision encoder adaptation (adapter-only tuning), followed by fine-grained alignment (vision encoder and adapter), and culminating in joint multimodal reasoning (all parameters unfrozen). This staged approach enables stepwise knowledge injection and robust cross-modal alignment.

Post-training consists of supervised fine-tuning (SFT) on curated instruction datasets (SAIL-Instruction2), followed by a "thinking-fusion" paradigm that combines SFT with reinforcement learning (RL) using both verifiable and mixed reward systems. Chain-of-Thought (CoT) data and RL with verifiable rewards are used to explicitly incentivize step-by-step reasoning, while think-fusion SFT and RL with mixed rewards further enhance logical coherence and output structure.

Figure 4: Analysis of instruction data quality and scale: SAIL-Instruction2 consistently yields superior SFT results, validating the data curation pipeline.

Architectural and Training Infrastructure

SAIL-VL2 incorporates several infrastructure optimizations to maximize training efficiency:

Stream Packing: Online packing of variable-length language and vision tokens minimizes padding, maximizes GPU utilization, and balances workloads across devices. Visual packing ensures balanced computation for the vision encoder, especially in AnyRes settings.
MoE Infrastructure: For MoE models, kernel fusion and hardware-adapted distributed frameworks (Megatron on Ascend, DeepSpeed ZeRO-2 on NVIDIA) address computational and communication bottlenecks, enabling efficient scaling to 30B+ parameters with sparse activation.

Empirical Results

SAIL-VL2 demonstrates strong empirical performance across a wide spectrum of benchmarks:

General Multimodal Understanding: SAIL-VL2-2B and 8B achieve state-of-the-art results on MMBench, MMStar, RealWorldQA, DocVQA, and OCRBench, outperforming all open-source models at comparable scales.
Fine-Grained Visual Grounding: SAIL-VL2-AnyRes-2B achieves leading results on RefCOCO, validating the effectiveness of arbitrary-resolution visual encoding.
Video and Multi-Image Understanding: SAIL-VL2-2B leads on Video-MME, LongVideoBench, and MMIU, demonstrating robust temporal and multi-image reasoning.
Multimodal Reasoning: SAIL-VL2-8B-Thinking and SAIL-VL2-A3B-Thinking set new open-source state-of-the-art on the OpenCompass reasoning leaderboard, with the MoE variant achieving high efficiency (3B activated parameters) and competitive performance against much larger closed-source models.
Figure 5: Token embedding distributions: SAIL-ViT visual tokens exhibit greater spatial overlap with text embeddings compared to AIMv2, indicating superior cross-modal alignment.

Theoretical and Practical Implications

The SAIL-VL2 framework demonstrates that high-quality data curation, progressive training, and efficient architectural design can yield compact LVMs with performance rivaling or surpassing much larger models. The staged training paradigm, particularly the integration of CoT and RL-based post-training, is shown to be critical for advanced reasoning capabilities. The MoE variant provides a practical path for scaling without prohibitive computational costs, maintaining high performance with sparse activation.

The empirical results challenge the prevailing assumption that only massive parameter counts can deliver state-of-the-art multimodal reasoning, highlighting the importance of data quality, alignment strategies, and targeted post-training.

Future Directions

Potential future developments include:

Further architectural efficiency via more advanced MoE routing and expert specialization.
Expansion of the data curation pipeline to cover additional modalities (e.g., audio, 3D).
Enhanced RL paradigms for more nuanced reward modeling and self-improvement.
Broader deployment in resource-constrained environments, leveraging the efficiency of SAIL-VL2's design.

Conclusion

SAIL-VL2 establishes a new standard for efficient, high-performing open-source vision-LLMs. Through innovations in data curation, progressive training, and architectural design, it achieves state-of-the-art results across a comprehensive suite of benchmarks, including challenging reasoning tasks. The model suite demonstrates that with careful design, compact LVMs can deliver robust multimodal understanding and reasoning, providing a scalable and extensible foundation for the open-source community and future research in multimodal AI.