Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models (2501.14818v1)

Published 20 Jan 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Recently, promising progress has been made by open-source vision-LLMs (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.

PDF Abstract

An Examination of "Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-LLMs"

This paper presents a comprehensive exploration into the development of post-training data strategies aimed at enhancing the performance of open-source vision-LLMs (VLMs). The authors introduce a suite of strategies, encapsulated within a novel VLM series named Eagle 2, with the Eagle2-9B model achieving noteworthy results on several multimodal benchmarks. The research underscores a data-centric approach in refining VLM capabilities to align closely with proprietary models, such as those employed by GPT-4V.

Data Strategy and Architecture

The authors emphasize the significance of a meticulous data strategy, often overshadowed by model architecture and algorithmic sophistication. They propose systematic data strategies that include:

Extensive Data Collection: The strategy involves gathering diverse data from over 180 sources, emphasizing both diversity and quality. This includes the curation of multimodal data that align with specific model requirements.
Data Filtering and Selection: The authors advocate for rigorous data filtering to exclude low-quality samples and employ techniques to refine the dataset to yield high-quality subsets suitable for training. This includes the use of similarity scores to gauge the overlap and uniqueness of new data.
Data Augmentation: Techniques such as rule-based QA generation and chain-of-thought explanations are employed to enhance data utility, ensuring comprehensive representation of potential real-world scenarios.
Model Architecture Incorporation: The integration of a "tiled mixture of vision encoders" (MoVE) is posited as a significant improvement. This involves combining high-resolution image inputs and multiple encoder designs to bolster model performance across various tasks.

Training Approach

The work introduces a nuanced three-stage training approach:

Stage-1 leverages initial training data to align the visual and language components of the model.
Stage-1.5 focuses on extensive pre-training using large datasets to augment the model’s foundational knowledge, incorporating datasets across multiple categories to ensure robust multimodal capabilities.
Stage-2 involves fine-tuning with carefully selected high-quality data to address specific tasks, optimizing performance on targeted benchmarks.

The implication of such a stratified training process is the potential for more efficient model training, leading directly to improved performance without disproportionately increasing computation requirements.

Results and Implications

Eagle2-9B exemplifies the success of these strategies, matching or surpassing the performance of leading VLMs with significantly larger parameter counts—such as InternVL2-26B and LLaVa-OneVision-72B—across several benchmarks. This demonstrates the effectiveness of a data-centric strategy in bridging the gap between open-source and proprietary models, highlighting data handling and preparation as pivotal factors in VLM development.

The results indicate that with appropriate data strategies, substantial gains in model capability and performance can be realized even with models of moderate size. The presented methods provide a guideline for researchers aiming to optimize VLMs by focusing not just on architectural improvements but on robust data handling methodologies.

Future Directions

This research opens avenues for exploring:

The extension of these data strategies to other multimodal and multilingual domains.
Further optimization of data-centric methods to streamline model training and enhance efficiency.
The development of more sophisticated data augmentation and filtering techniques that may uncover latent capabilities in current VLM architectures.

Overall, the paper provides pivotal insights into the impact of strategic data management in model development, underscoring the role of comprehensive data curation and utilization in enhancing VLM capabilities, presenting a clear path forward for the development of competitive open-source VLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (27)

Zhiqi Li (42 papers)
Guo Chen (107 papers)
Shilong Liu (60 papers)
Shihao Wang (32 papers)
Vibashan VS (22 papers)
Yishen Ji (4 papers)
Shiyi Lan (38 papers)
Hao Zhang (947 papers)
Yilin Zhao (17 papers)
Subhashree Radhakrishnan (7 papers)
Nadine Chang (11 papers)
Karan Sapra (13 papers)
Amala Sanjay Deshmukh (3 papers)
Tuomas Rintamaki (4 papers)
Matthieu Le (7 papers)
Ilia Karmanov (7 papers)
Lukas Voegtle (3 papers)
Philipp Fischer (10 papers)
De-An Huang (45 papers)
Timo Roman (4 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/jbohnslav/status/1884250506681582012

https://twitter.com/bvttuan/status/1930274031753933120