InternVL2-8B: Open-Source MLLM for Reasoning
- InternVL2-8B is an open-source 8B-parameter multimodal model that enhances cross‐modal chain-of-thought reasoning via Mixed Preference Optimization.
- It leverages an automated MMPR pipeline to construct high-quality multimodal preference data, overcoming data scarcity and improving output fidelity.
- Empirical evaluations show an 8.7% accuracy boost on MathVista with reduced hallucinations, narrowing the performance gap with larger models.
InternVL2-8B is an open-source multimodal LLM (MLLM) designed to advance high-quality cross-modal reasoning with an 8B-parameter architecture. It stands out for introducing two core innovations: a scalable automated preference data construction pipeline yielding the MMPR dataset and a training paradigm—Mixed Preference Optimization (MPO)—that significantly enhances chain-of-thought (CoT) and multimodal reasoning over previous supervised fine-tuning approaches. Empirical evaluations demonstrate that InternVL2-8B-MPO not only outperforms its baseline and narrows the performance gap with much larger models but also controls hallucinations, thereby establishing a new standard for compact MLLMs.
1. Model Overview and Objectives
InternVL2-8B is constructed to address the limitations inherent in prior open-source MLLMs: specifically, the reasoning performance degradation induced by distribution shifts between pre-training and supervised fine-tuning stages. Its goal is to match or surpass the cross-modal reasoning abilities, particularly in CoT tasks, of models with much larger parameter counts (e.g., InternVL2-76B), while retaining efficiency and open accessibility.
The architecture utilizes a standard vision-language backbone. The central novelty arises in data engineering (MMPR pipeline) and a tailored training objective (MPO), both of which are designed to boost model alignment with high-quality, contextually grounded outputs.
2. Automated Multimodal Preference Data Construction
InternVL2-8B’s improved reasoning capabilities are enabled by an automated pipeline for constructing the Multimodal Preference Reasoning (MMPR) dataset. The pipeline is engineered to overcome the scarcity and annotation cost of high-quality multimodal reasoning data by generating preference pairs at scale.
A single MMPR data sample comprises an image , instruction , a "chosen" response , and a "rejected" response , with deemed higher quality. The construction methodology is bifurcated as follows:
- Tasks with Clear Ground Truth: An initial model generates multiple candidate responses from . Those matching the ground truth form the positive set (); incorrect responses or those lacking a clear answer ground .
- Open-Ended Instructions (No Ground Truth): The Dropout Next Token Prediction (DropoutNTP) technique is used. The model outputs a full answer , which is then truncated (by removing the last half of tokens), and the model is prompted to complete the remainder, this time without access to the image. The full original answer becomes , while the typically hallucinated completion is set as .
This hybrid process—combining correctness filtering and DropoutNTP—yields approximately 3 million diverse, scalable samples with multimodal preference structure, covering both fact-based and open-domain reasoning instructions.
| Pipeline Branch | Source of | Source of |
|---|---|---|
| Ground Truth | Matches ground truth | Incorrect/lacks clear final answer |
| DropoutNTP (open-ended) | Full output (image+text) | Completion (text only, lacks image) |
3. Mixed Preference Optimization (MPO) Training Objective
Mixed Preference Optimization (MPO) is a training paradigm that extends conventional supervised fine-tuning by incorporating preference-based losses, integrating the following three objectives linearly:
- Preference Loss (): Derived from Direct Preference Optimization (DPO) and based on the Bradley-Terry model. This loss enforces relative ranking between chosen and rejected responses with reference to the initial model :
where is the current model, is a penalty coefficient, and is the logistic sigmoid.
- Quality Loss (): Inspired by BCO, this term supervises the model to output a binary label for each response, assigning 1 to chosen and 0 to rejected, thus directly biasing the model toward higher-quality completions.
- Generation Loss (): The canonical supervised fine-tuning likelihood loss over the chosen output, promoting faithful prompt following and complete, correct generation.
The composite loss function is:
Empirically, weights are selected as , , .
Through this joint optimization, InternVL2-8B is trained to both internalize nuanced preference signals and maintain high output quality, substantially improving both accuracy and hallucination control.
4. Empirical Results and Comparative Performance
The combination of MMPR-based data and MPO yields marked improvements on standard multimodal benchmarks. Specifically, the model variant InternVL2-8B-MPO achieves 67.0% accuracy on MathVista—a dedicated multimodal mathematical reasoning benchmark—compared to 58.3% for its baseline variant. This 8.7-point gain not only narrows, but in some tasks closes, the gap with the much larger InternVL2-76B (10× parameter count), underscoring that the improvement in reasoning ability and reduced hallucination are direct outcomes of the new pipeline, not mere scaling effects.
| Model | MathVista Accuracy | Relative Improvement |
|---|---|---|
| InternVL2-8B | 58.3% | — |
| InternVL2-8B-MPO | 67.0% | +8.7 |
| InternVL2-76B | ≈67% | Parity with 8B-MPO |
Application of this paradigm to other MLLM tasks and datasets demonstrates similar trends in performance elevation and generalization robustness.
5. Hallucination Control and Generalization
A central benefit of the MPO approach is the active suppression of hallucinations. The rejection-based components of the loss function (particularly the DropoutNTP data and preference loss) present the model with explicit negative completion examples, often corresponding to plausible but incorrect or context-inappropriate outputs. This guides InternVL2-8B to distinguish contextually sound responses, improving both factual precision and the likelihood of correct, multi-step chain-of-thought reasoning.
This mechanism enhances not only CoT robustness but also generalization across modalities and instructions, as demonstrated by consistent improvements on diverse benchmarks beyond MathVista.
6. Implementation Considerations and Resource Implications
Key considerations for deploying InternVL2-8B-MPO include the need for substantial data throughput (≈3M multimodal preference pairs) and increased training schedule complexity, due to the combination of multiple loss terms and reference models in each optimization step. The computational requirements remain practical for an 8B-parameter model, making InternVL2-8B-MPO a viable solution for research and real-world applications on moderate-scale hardware.
Empirical results further indicate that the design of the MMPR construction pipeline and the precise MPO loss weighting critically affect final performance, suggesting opportunities for downstream customization (e.g., task-specific loss calibration or further data augmentation).
7. Impact and Position in the Multimodal Modeling Landscape
The InternVL2-8B and its MPO-enhanced variant represent a methodological inflection among lightweight, open-source MLLMs. Through synergistic integration of automated data curation (MMPR) and advanced learning objectives (MPO), the model advances the state of multimodal reasoning—particularly for tasks requiring detailed multi-step inference and robust hallucination rejection—without requiring prohibitively large model sizes. These innovations establish a new baseline for efficient, reasoning-capable vision-language systems and provide a foundation for further research in both algorithmic and data-centric modeling approaches (Wang et al., 15 Nov 2024).