- The paper introduces SAIL-VL, a 2B parameter vision-language model leveraging a meticulously curated data pipeline for enhanced visual understanding.
- The paper demonstrates that training with 131 billion tokens significantly boosts performance, following logarithmic scaling laws.
- The paper outlines comprehensive guidelines for data curation and instruction tuning, resulting in top-tier scores on 19 benchmarks.
Scalable Vision LLM Training via High-Quality Data Curation
The paper introduces the SAIL-VL model, an open-source vision-LLM (VLM) exhibiting state-of-the-art (SOTA) performance within its parameter count of 2 billion (2B). The focal point of this research is the detailed curation of high-quality data, which significantly contributes to the model's outstanding performance. The authors introduce a comprehensive pipeline for the construction of large-scale, high-quality visual understanding datasets, which allows the model to achieve top rankings across multiple benchmarks.
Key Contributions
- Data Pipeline Implementation: The authors have structured an efficient pipeline for the construction of visual understanding data, leading to the SAIL-Caption dataset. This dataset is superior both in quantity and quality compared to previously available caption datasets.
- Demonstration of Data Scaling Laws: The SAIL-VL model's pretraining involves a substantial dataset of 131 billion tokens, showcasing that even compact models benefit from increased data scales. Performance scaling laws indicate that model performance responds logarithmically to increases in data size.
- Guidelines for Instruction Tuning: The paper provides insights into general strategies for the curation and scaling of instruction datasets. These guidelines facilitate the development of a robust model performance across various tasks.
Methodology Overview
The SAIL-VL model training comprises several outlined stages:
- Pretraining Stages: These stages focus on amassing foundational visual understanding capabilities through alignment with vast datasets.
- Supervised Fine-Tuning (SFT): Employing multiple stages of fine-tuning, this phase integrates knowledge, instruction adherence, and preference learning from meticulously curated datasets.
This multi-stage training process enhances model learning from exponential data growth, progressing to more refined abilities with distinct data qualities. Notably, the SAIL-VL model completes these training stages with top-tier benchmarking results, outstripping the performance of VLMs of equivalent sizes.
The SAIL-VL model achieves the highest average score across 19 benchmarks and excels in the OpenCompass leaderboard, marking it as a leader among VLMs with similar parameter sizes. This assertive performance highlights the relevance of detailed and high-caliber data in pretraining and instruction stages.
Implications and Future Work
The work suggests promising implications for practical applications in areas demanding robust visual comprehension and advanced instruction-following capabilities. Looking forward, the exploration of extending these methodologies to larger-scale models presents an avenue for future inquiry. Moreover, broadening the examination of scaling laws and high-quality data impacts offers potential research extensions. This could enhance the generalization ability and the interpretability of vision-LLMs across a broader range of tasks.
Overall, this research underscores the importance of scalable, high-quality data as a foundation for training highly effective VLMs, with potential implications across numerous fields within artificial intelligence.