Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Abstract: We introduce InternVL 2.5, an advanced multimodal LLM (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, LLMs, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL
First 10 authors:
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces InternVL 2.5, a powerful “multimodal” AI model. Multimodal means it can understand both words and visuals—like reading text, looking at pictures, and even watching videos. The goal is to push the performance of open-source AI models closer to top commercial systems (like GPT-4o and Claude 3.5), while keeping everything free and transparent for the research community.
Objectives
The paper asks simple but important questions:
- What happens to performance when we make different parts of the model bigger (like the vision part or the language part)?
- How much does data size and data quality matter?
- Can we get better results by using smarter strategies at test time (like asking the model to “think step by step”)?
Methods and Approach
How the model is built
Think of the model as a team:
- A camera (the “vision encoder”) that turns images into useful features the model can understand.
- An adapter (the MLP projector) that helps the camera talk to the brain.
- A brain (the LLM, or LLM) that reads text, reasons, and produces answers.
InternVL 2.5 keeps the same core design as earlier versions: a ViT (Vision Transformer) for images, a 2-layer MLP projector, and a strong LLM (such as InternLM 2.5 or Qwen 2.5). It also supports single images, multiple images, and videos.
To handle big, detailed images, the model splits each image into tiles (like cutting a poster into squares) so it can see fine details without exploding memory. A 448×448 tile becomes 256 “visual tokens” after a compacting step (called pixel unshuffle), which helps the model process high-resolution inputs efficiently.
How they train the model (in three steps)
The training process is done in stages to keep everything stable and efficient:
- Stage 1: Warm up the adapter (MLP). Only the adapter learns, while the camera and brain stay frozen. This helps the visual features align with the LLM.
- Stage 1.5 (optional): Improve the camera. Now both the camera and adapter learn together, so the model sees visuals better—especially tricky stuff like charts or text in different languages.
- Stage 2: Teach the whole team. The camera, adapter, and brain all learn together using high-quality, carefully filtered instruction data, so the model gives reliable answers to real-world questions.
There’s also a smart “progressive scaling” idea: train the camera with a smaller brain first (cheaper!), then plug the improved camera into a bigger brain later without retraining the camera. This saves lots of compute and time.
Handling big and varied data
To make training fast and robust:
- Dynamic high-resolution tiling: Choose how many tiles to use based on the image type (many tiles for detailed documents, fewer for simple photos, and just 1 tile per video frame).
- Data packing: Combine smaller samples together to fill the model’s input space without wasting memory.
- Real-world robustness: Random JPEG compression simulates the noisy, compressed images you often see online.
- Balanced training: A loss “reweighting” trick prevents the model from favoring either very short or very long answers during training.
Cleaning up bad data
Even a tiny amount of bad training data can cause weird behavior, like the model repeating the same sentence over and over—especially in long, step-by-step answers. The team built a filtering pipeline to:
- Score and remove low-quality text samples using an LLM and rules (like catching abnormal length or repeated lines).
- Detect and remove repetitive patterns in multimodal datasets.
Test-time tricks
For hard questions, the model does better if it:
- Uses Chain-of-Thought (CoT), which means “think step by step” before answering.
- Combines CoT with majority voting (generate several answers and pick the most common one).
Main Findings
Here are the key results and why they matter:
- Bigger vision encoders reduce the need for enormous training data. A strong 6B-parameter vision encoder let InternVL 2.5 train with about one-tenth the tokens compared to a similar-sized model with a smaller vision encoder—saving huge costs.
- Data quality is crucial. Doubling the dataset size helped, but strict filtering helped more—especially for step-by-step reasoning tasks (like MMMU and OlympiadBench).
- Test-time scaling works. Asking the model to “think step by step” boosted scores on difficult benchmarks; adding majority voting helped even more.
- Top-tier performance. InternVL 2.5 rivals commercial models (GPT-4o and Claude 3.5) on many multimodal tasks and is the first open-source model to surpass 70% on the MMMU benchmark (with CoT).
- Broad skills. It was tested across many areas: science and math reasoning, reading documents, understanding multiple images and videos, grounding text in images, multilingual tasks, and pure language understanding.
Implications and Impact
InternVL 2.5 shows that open-source multimodal AI can be:
- Competitive: It can approach or match commercial systems on many tasks.
- Efficient: Smart training and scaling strategies can cut costs drastically.
- Reliable: Careful data filtering and test-time reasoning make answers more stable and accurate.
This helps researchers and developers build strong, transparent AI systems that understand both text and visuals—useful for education, research, accessibility tools, document analysis, and more. It also sets a new standard for the open-source community: with better models, cleaner data, and smarter testing, open systems can keep pushing the boundaries of what AI can do.
Collections
Sign up for free to add this paper to one or more collections.