Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (2412.05271v3)

Published 6 Dec 2024 in cs.CV

Abstract: We introduce InternVL 2.5, an advanced multimodal LLM (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, LLMs, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL

PDF HTML Abstract

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

The paper explores the enhancements in the multimodal LLM (MLLM) series named InternVL 2.5, an extension of its foundational model, InternVL 2.0. The research highlights significant improvements in training methodologies, data quality, and test-time strategies. By retaining the core "ViT-MLP-LLM" architecture but optimizing various components, such as vision encoders and dataset scaling, InternVL 2.5 achieved notable advancements, particularly in its ability to bridge the performance gap with commercial closed-source models like GPT-4o and Claude-3.5-Sonnet.

Key Findings

Vision Encoders and Training Tokens: The investigation into the scaling of vision encoders reveals a substantial reduction in dependency on vast training data. For instance, the InternVL2.5-78B with a 6 billion parameter vision encoder achieved superior performance with only a tenth of the training tokens used by comparable models, such as Qwen2-VL-72B. This suggests a cost-efficient approach to scaling MLLMs while maintaining high performance.
Dataset Quality and Chain-of-Thought Reasoning: The upgrade to InternVL 2.5 involved not just increased dataset size but also a rigorous curation for data quality. This step led to marked improvements, particularly in Chain-of-Thought (CoT) reasoning tasks, underscoring the importance of data quality alongside scale.
Test-Time Scaling: The work demonstrates that test-time scaling is particularly advantageous for difficult multimodal question-answering tasks. Through this method, InternVL2.5-78B achieved a 70.1% score on the MMMU benchmark, illustrating the model's ability to integrate various performance-enhancing strategies effectively.

Implications and Future Developments

The paper's findings indicate significant implications for the development of MLLMs, both open-source and proprietary. InternVL 2.5 sets a benchmark for multimodal tasks, making strides toward closing the gap with proprietary commercial models, thus contributing valuable insights and technologies to the open-source community.

Practically, the model's open-source nature facilitates transparency and accessibility, promoting further research and adaptation across different computational frameworks and applications. Theoretically, the research reaffirms the correlation between model scaling strategies, data quality, and performance metrics, offering a structured pathway for developing models with higher efficiency and performance in handling multimodal data.

The results suggest several pathways for future research. Continued optimization in data curation and exploration into even larger model architectures could push the boundaries further. Additionally, enhancing capabilities in multilinguistic and diverse real-world scenarios could broaden the applicability and reliability of MLLMs like InternVL 2.5. As the field evolves, these models may benefit from integration with emerging technologies or methodologies such as preference optimization or advanced feedback systems, to further refine their reasoning and contextual understanding capabilities.

The release of InternVL 2.5 to the open-source community is poised to serve as a robust tool for advancing multimodal AI systems, catalyzing both academic research and practical applications. As researchers harness the insights from this paper, future developments will likely explore new frontiers in scalable, efficient, and high-performing multimodal AI systems that can rival and perhaps one day surpass their commercial counterparts.