Should VLMs be Pre-trained with Image Data? (2503.07603v1)

Published 10 Mar 2025 in cs.CV

Abstract: Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.

PDF Abstract

Overview of "Should VLMs be Pre-trained with Image Data?"

The paper "Should VLMs be Pre-trained with Image Data?" by Keh et al. aims to analyze the implications of integrating visual data into the pre-training phase of Vision-LLMs (VLMs) that are initially based on LLMs. This paper positions itself within the current landscape of multimodal models, seeking to determine if incorporating image data earlier in the pre-training process yields better performance than traditional sequential training methods.

Methodology

The researchers conducted extensive experiments involving over 300 models to explore the relationship between the timing of introducing visual data into the training process and the model's performance on downstream tasks. The experiments varied across several parameters, including model size, data scale, image-to-text data ratio, and stages of LLM pre-training before incorporating vision data.

The experimental setup followed a structured three-step training process. Firstly, models underwent a phase of text-only pre-training to a certain completion percentage. Subsequently, image data was introduced, allowing for a mixed image-text pre-training regime, followed by a final instruction fine-tuning phase designed to optimize performance on specific tasks.

Key Findings

Timing of Image Data Introduction: The paper observed that introducing image data during the cooldown phase, which occurs around 80% of the way through text pre-training, resulted in significant performance improvements in both vision-language tasks and maintained quality in text-only tasks.
Optimal Image-Text Ratio: The researchers determined that an image token allocation of 10% to 20% during the mixed pre-training phase is optimal for robust performance across both modalities at the 1B parameter scale. However, this ratio was found to be contingent on the model size, indicating a need for larger visual data fractions for smaller models.
Fine-Tuning Regimen: It was found that mixing instruction tuning data with image-text data during pre-training could deteriorate model performance. Instead, the paper suggests incorporating instruction tuning data should be a distinct fine-tuning stage, conducted after the completion of image-text pre-training.
Training from Scratch: Training models from scratch with image data did not yield advantageous outcomes, suggesting that LLMs require a foundational proficiency in processing text-only data before incorporating visual data.

Implications

This work has significant ramifications for the development of VLMs. The finding that image data should be introduced well before the completion of language pre-training is critical for optimizing training pipelines. It challenges conventional approaches that strictly separate language and image pre-training phases.

Future Prospects

The paper opens several avenues for future research, including exploring the scalability of introducing images earlier in the training cycle for much larger multimodal architectures. Additionally, investigating different visual encoders and datasets for pre-training could potentially further enhance performance gains noted in this paper.

Conclusion

"Should VLMs be Pre-trained with Image Data?" sheds light on the critical role and timing of visual input in VLM pre-training protocols. By advocating for an integrated pre-training process that blends text and image data earlier, this paper contributes essential insights that could refine future strategies in developing more capable and efficient vision-language systems.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Sedrick Keh (8 papers)
Jean Mercat (15 papers)
Samir Yitzhak Gadre (12 papers)
Kushal Arora (13 papers)
Igor Vasiljevic (20 papers)
Benjamin Burchfiel (19 papers)
Shuran Song (110 papers)
Russ Tedrake (91 papers)
Thomas Kollar (27 papers)
Ludwig Schmidt (80 papers)
Achal Dave (31 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/sedrickkeh2/status/1900217727475999212

https://twitter.com/sedrickkeh2/status/1915068368631304338

https://twitter.com/semisance/status/1899779804498457047

YouTube

Show All Videos