Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 161 tok/s Pro

2000 character limit reached

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources (2504.00595v2)

Published 1 Apr 2025 in cs.CL

Abstract: The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal LLM pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.

Collections

Summary

Examining Open-Qwen2VL: A Study in Compute-Efficient Multimodal LLM Pre-Training

This essay elucidates the methodology and findings presented in the paper of Open-Qwen2VL, a novel Multimodal LLM (MLLM) characterized by its compute-efficient, open-source framework. The research addresses significant challenges faced in the pre-training of multimodal LLMs, emphasizing an approach that allows this model to outperform existing state-of-the-art partially open multimodal models.

Methodological Approach

The Open-Qwen2VL is a 2B-parameter model pre-trained on 29 million image-text pairs. Noteworthy is its resource-efficient training process which utilized 442 A100-40G GPU hours, a significant reduction compared to traditional pre-training regimens. The training was conducted with 5 billion packed multimodal tokens, constituting only 0.36% of the tokens used in comparable models like Qwen2-VL.

Data Selection and Filtering Techniques

A cornerstone of the Open-Qwen2VL's efficiency lies in its sophisticated data selection and filtering processes. The dataset was curated using both traditional CLIP-based filtering and an innovative MLLM-based filtering approach named MLM-Filter. This dual-filter technique was instrumental in ensuring high-quality, high-relevance data for efficient pre-training.

In terms of dataset composition, the model was trained using a combination of several popular image-text caption datasets, including CC3M, CC12M, and SBU, filtered by either CLIP or MLLM techniques. A significant outcome from the paper was that integrating a small portion of data curated through MLLM-based filters contributed to enhanced model performance, suggesting MLLM filters' effectiveness in providing beneficial stylistic and semantic diversity within training datasets.

Architectural Innovations

The model architecture incorporated several strategies aimed at maximizing pre-training efficiency. Notably, it used a dynamic visual token representation strategy that incorporated an adaptive average pooling visual projector, allowing for down-sampling in the pre-training phase and up-scaling during fine-tuning. This method reduced computational demands without compromising the model's capability for high-resolution image understanding post fine-tuning.

Additionally, the innovative practice of multimodal sequence packing was employed to pack image-text data into optimized groups, minimizing padding token waste and ensuring more efficient GPU utilization. This approach was pivotal in promoting efficient processing and learning across varied sequence lengths.

Performance and Results

Empirical evaluations detailed in the paper reveal that Open-Qwen2VL demonstrates competitive, and in several instances superior, performance across multiple benchmarks, such as MMBench and MathVista. The model's performance, achieved under a markedly lower computational budget, underscores its efficiency.

The paper emphasizes that the Open-Qwen2VL's training framework and methodological openness offer the wider research community an accessible pathway to reproducing and extending this work. Additionally, the release of its comprehensive codebase, along with data filtering and sequence packing scripts, sets a new standard for "fully open" multimodal LLMs—encouraging collaborative improvements and fostering innovation outside of well-funded corporate environments.

Implications and Future Work

The implications of Open-Qwen2VL extend both practically and theoretically. By demonstrating that sophisticated filtering and efficient pre-training architecture can enable high-performance models on limited computational resources, the paper challenges the notion that cutting-edge LLM development is exclusively the domain of large technology firms. It advocates for more inclusive advancements in AI, where academic and smaller research institutions can make significant contributions.

The paper also highlights several avenues for future work, including potential improvements in multimodal data filtering strategies and the exploration of other architectural modifications that could further reduce computational requirements. Such developments could further democratize access to AI technology, enabling a broader spectrum of applications in fields demanding visual-textual comprehension, from enhanced educational tools to more sophisticated image-based searches.

In summary, Open-Qwen2VL presents a pioneering stride in the pursuit of efficiency and accessibility in the domain of multimodal LLMs. Its successful realization serves as a framework for similarly resource-efficient developments that utilize innovative data processing and model architecture strategies.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (5)

Tweets

https://twitter.com/_akhaliq/status/1907294313316229297