Scalable Vision Language Model Training via High Quality Data Curation (2501.05952v3)

Published 10 Jan 2025 in cs.CV and cs.CL

Abstract: In this paper, we introduce SAIL-VL (ScAlable Vision LLM TraIning via High QuaLity Data Curation), an open-source vision LLM (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters. The following three key improvements contribute to SAIL-VL's leading performance: (1) Scalable high-quality visual understanding data construction: We implement a data construction pipeline to enable hundred-million-scale high-quality recaption data annotation. The resulted dataset SAIL-Caption is validated to be of the highest data quality compared with opensource datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL's pretraining budget up to 655B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting logarithmic data size scaling laws in benchmark performance. (3) Scalable SFT via data quantity and complexity scaling: We curate a high-quality SFT dataset collection with leading data quantity scaling effectiveness and demonstrate that training with progressively higher-complexity data surpasses baseline one-stage training by a large margin. SAIL-VL series models achieve the highest average score in 18 widely used VLM benchmarks in our evaluation, with the 2B model takes the top position over VLMs of comparable sizes on OpenCompass 2024 (https://rank.opencompass.org.cn/leaderboard-multimodal), demonstrating robust visual comprehension abilities. SAIL-VL series models are released at HuggingFace (https://huggingface.co/BytedanceDouyinContent).

Collections

Summary

The paper introduces SAIL-VL, a 2B parameter vision-language model leveraging a meticulously curated data pipeline for enhanced visual understanding.
The paper demonstrates that training with 131 billion tokens significantly boosts performance, following logarithmic scaling laws.
The paper outlines comprehensive guidelines for data curation and instruction tuning, resulting in top-tier scores on 19 benchmarks.

Scalable Vision LLM Training via High-Quality Data Curation

The paper introduces the SAIL-VL model, an open-source vision-LLM (VLM) exhibiting state-of-the-art (SOTA) performance within its parameter count of 2 billion (2B). The focal point of this research is the detailed curation of high-quality data, which significantly contributes to the model's outstanding performance. The authors introduce a comprehensive pipeline for the construction of large-scale, high-quality visual understanding datasets, which allows the model to achieve top rankings across multiple benchmarks.

Key Contributions

Data Pipeline Implementation: The authors have structured an efficient pipeline for the construction of visual understanding data, leading to the SAIL-Caption dataset. This dataset is superior both in quantity and quality compared to previously available caption datasets.
Demonstration of Data Scaling Laws: The SAIL-VL model's pretraining involves a substantial dataset of 131 billion tokens, showcasing that even compact models benefit from increased data scales. Performance scaling laws indicate that model performance responds logarithmically to increases in data size.
Guidelines for Instruction Tuning: The paper provides insights into general strategies for the curation and scaling of instruction datasets. These guidelines facilitate the development of a robust model performance across various tasks.

Methodology Overview

The SAIL-VL model training comprises several outlined stages:

Pretraining Stages: These stages focus on amassing foundational visual understanding capabilities through alignment with vast datasets.
Supervised Fine-Tuning (SFT): Employing multiple stages of fine-tuning, this phase integrates knowledge, instruction adherence, and preference learning from meticulously curated datasets.

This multi-stage training process enhances model learning from exponential data growth, progressing to more refined abilities with distinct data qualities. Notably, the SAIL-VL model completes these training stages with top-tier benchmarking results, outstripping the performance of VLMs of equivalent sizes.

Performance Evaluation

The SAIL-VL model achieves the highest average score across 19 benchmarks and excels in the OpenCompass leaderboard, marking it as a leader among VLMs with similar parameter sizes. This assertive performance highlights the relevance of detailed and high-caliber data in pretraining and instruction stages.

Implications and Future Work

The work suggests promising implications for practical applications in areas demanding robust visual comprehension and advanced instruction-following capabilities. Looking forward, the exploration of extending these methodologies to larger-scale models presents an avenue for future inquiry. Moreover, broadening the examination of scaling laws and high-quality data impacts offers potential research extensions. This could enhance the generalization ability and the interpretability of vision-LLMs across a broader range of tasks.

Overall, this research underscores the importance of scalable, high-quality data as a foundation for training highly effective VLMs, with potential implications across numerous fields within artificial intelligence.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (6)

Tweets

https://twitter.com/xuandongzhao/status/1879082664747245908