Papers
Topics
Authors
Recent
2000 character limit reached

InternVL 2.5: Advanced Multimodal LLM

Updated 1 January 2026
  • InternVL 2.5 is an advanced open-source multimodal LLM series that extends InternVL 2.0 by improving model scaling, training data quality, and inference techniques.
  • It employs a ViT–MLP–LLM pipeline with Chain-of-Thought decoding and majority-voting ensembling to enhance performance in visual and language tasks.
  • The model achieves competitive benchmarks, surpassing 70% accuracy on MMMU and excelling in tasks like DocVQA and multi-image comprehension through systematic architectural refinements.

InternVL 2.5 is an advanced open-source multimodal LLM (MLLM) series that extends the InternVL 2.0 architecture. It introduces notable improvements across model scaling, training data quality, and inference strategies. InternVL 2.5 systematically investigates scaling laws in vision encoders, LLM backbones, dataset sizes, and test-time procedures. Through extensive evaluation spanning multi-disciplinary reasoning, document and video understanding, real-world image-text comprehension, hallucination detection, visual grounding, and multilingual tasks, InternVL 2.5 demonstrates competitive performance, surpassing 70% accuracy on the MMMU benchmark with innovations such as Chain-of-Thought (CoT) and majority-voting ensembles. Other research, including InternLM-XComposer-2.5, sometimes refers to a variant or implementation of InternVL 2.5, with similar architectural foundations and long-context visual-language capabilities (Chen et al., 2024, Zhang et al., 2024).

1. Model Architecture and Scaling

InternVL 2.5 preserves the “ViT–MLP–LLM” pipeline of InternVL 2.0:

  • Vision Encoder: The primary configuration is InternViT-6B-448px-V2.5, a 45-layer transformer (hidden size 3200, 25 heads, Nparam,vision=5.5×109N_{\mathrm{param,vision}}=5.5\times10^9), with a smaller 300M-variant for efficiency.
  • MLP Projector: A 2-layer MLP trained from scratch bridges the vision and language modalities.
  • LLM Backbone: Multiple scale variants are available:

| Variant | Backbone | Nparam,LMN_{\mathrm{param,LM}} (B) | |----------------------|---------------------|-----------------------------| | InternVL 2.5–1B | Qwen 2.5-0.5B | 0.5 | | InternVL 2.5–2B | InternLM 2.5-1.8B | 1.8 | | InternVL 2.5–4B | Qwen 2.5-3B | 3.0 | | InternVL 2.5–8B | InternLM 2.5-7B | 7.0 | | InternVL 2.5–26B | InternLM 2.5-20B | 20.0 | | InternVL 2.5–38B | Qwen 2.5-32B | 32.0 | | InternVL 2.5–78B | Qwen 2.5-72B | 72.0 |

The largest configuration yields Nparam,vision=5.5N_{\mathrm{param,vision}}=5.5 B, Nparam,LM=72N_{\mathrm{param,LM}}=72 B, for a total of 78.4\approx78.4 B parameters.

Scaling Law: Empirical fit on OpenCompass scores follows S(N)α(N/109)β+γS(N)\approx\alpha\cdot(N/10^9)^{\beta}+\gamma, with β[0.22,0.28]\beta\in[0.22,0.28], α[4.5,5.0]\alpha\in[4.5,5.0], γ40\gamma\approx40.

InternLM-XComposer-2.5 (IXC-2.5), often synonymous with "InternVL 2.5", uses InternLM2-7B as backbone, CLIP ViT-L/14 (up-trained to 560×560560\times560), and Partial LoRA adapters (rank 256), with dynamic image partitioning and RoPE context extrapolation up to 96K tokens (Zhang et al., 2024).

2. Training Data and Quality Improvements

Dataset Composition:

  • InternVL 2.5 fine-tuning mixture totals 16.3M samples (up from 7.3M in v2.0).
  • Token modality split:
    • Single-image: 45.9%
    • Multi-image: 9.4%
    • Video: 39.8%
    • Text: 4.9%

High-Quality Sources:

  • Multi-image: BLINK, Mantis-Eval, MMIU, MuirBench, MMT-Bench, MIRB
  • Video: Video-MME, MVBench, MMBench-Video, MLVU, LongVideoBench, CG-Bench
  • Document/OCR: CharXiv, VCR, improved OCR corpora
  • Dialog/text: expanded UltraFeedback, UltraChat

Data Quality Protocols:

  • Repetitive-sample removal (~10K degenerate cases)
  • LLM-based scoring (7\geq7/10) for filtering text tasks
  • Heuristic exclusion of abnormal lengths/zero sequences

Performance Law: On MMMU, doubling data (DD, in millions) from 7.3M to 16.3M increases accuracy by ΔA+3.2%\Delta A\approx+3.2\% following AMMMUA0+klog10(D)A_{\mathrm{MMMU}}\approx A_0+k\,\log_{10}(D), k4.3k\approx4.3.

In IXC-2.5, pre-training employs general image-text and OCR datasets, batch sizes of 4096, cosine learning rate decay, and SFT on 24K interleaved image-text contexts (Zhang et al., 2024).

3. Inference and Test-Time Scaling Strategies

InternVL 2.5 enhances test-time performance by:

  • Chain-of-Thought (CoT) Decoding: Prompting with “Let me think step by step...” template prior to answer extraction.
  • Majority-Voting Ensembling: Aggregating K=3K=3 CoT reasoning runs per query.
  • Multi-Crop/Tile Evaluation: For high-resolution images, predictions are composed from multiple tiles.
Approach Reasoning Passes Accuracy (%) Gain vs Direct (%)
Direct 1 66.4
+CoT 1 70.1 +3.7
+CoT+Voting 3 72.3 +5.9

Accuracy saturates at K=3K=3 reasoning passes. CoT deadlocks are reduced by ~70% on OlympiadBench due to data quality protocols.

IXC-2.5 introduces RoPE extrapolation via dynamic NTK scaling for context lengths up to 96K tokens, supporting dense video frame and multi-image concatenation without retraining (Zhang et al., 2024).

4. Evaluation Protocols and Benchmark Results

InternVL 2.5-78B is extensively benchmarked against closed- and open-source models across domains:

Task/Benchmark InternVL 2.5-78B GPT-4o Claude-3.5-Sonnet Qwen2-VL-72B
MMMU (%) 70.1 69.1 68.3 64.5
DocVQA (ANLS) 95.1 92.8 95.2 96.5
TextVQA (%) 83.4 77.4 74.1 85.5
Video-MME w/ captions (%) 74.0 77.2 77.8
WildVision win-rate (%) 71.4 80.6 71.7
HallusionBench score (0–6) 57.4 55.5 58.1
RefCOCO+ avg (%) 92.3 91.1
MMMB avg (multilingual) (%) 86.3 74.8 86.8
OpenCompass 17-task avg (%) 72.9 (8B) 69.5

InternVL 2.5 is the first open-source MLLM to exceed 70% on MMMU.

IXC-2.5 demonstrates competitive results on 28 benchmarks, outperforming open-source SOTA on 16 and matching GPT-4V/Gemini Pro on 16 tasks, including MVBench, MLVU, DocVQA, MMDU, MathVista, and Design2Code (Zhang et al., 2024).

5. Chain-of-Thought Enhancement and Error Mitigation

The integration of CoT reasoning in InternVL 2.5 is formalized via a prompting template:

  • “Let me think step by step: [CoT reasoning] Therefore, the answer is …”

Direct test-time comparison (AdirectA_{\mathrm{direct}} vs. AcotA_{\mathrm{cot}}) yields Δ(CoT)=+3.7%\Delta_{(\mathrm{CoT})}=+3.7\% on MMMU validation for the 78B model (70.1% vs. 66.4%).

CoT in supervised fine-tuning and reward modeling is also a core component in IXC-2.5’s article composition optimizer pipeline.

Improved quality protocols—such as LLM-based filtering, heuristic sample exclusion, and repetitive/deadlock case removal—reduce incidence of CoT deadlocks by approximately 70% on OlympiadBench (Chen et al., 2024).

6. Applications, Open-Source Impact, and Future Directions

InternVL 2.5 advances multimodal AI via:

  • Wide Application Coverage: Document understanding, VQA, multi-image/video input, multi-turn dialogue, webpage code generation, high-quality text-image article composition.
  • Open-Source Release: All checkpoints, demos, and code are available on HuggingFace (https://huggingface.co/OpenGVLab/InternVL) and Github (https://github.com/InternLM/InternLM-XComposer).
  • Community Integration: Adoption in OpenCompass, VLMEvalKit, third-party toolchains.
  • Future Research:
    • RLHF-style post-training on higher-quality instruction data
    • Larger vision encoder backbones (Nparam,vision>6N_\mathrm{param,vision}>6 B)
    • Support for longer-context (>>16K tokens) and higher-resolution inputs (>>4K px)
    • Preference-based open-ended generation for enhanced user experience

InternVL 2.5 establishes open-source state-of-the-art on MMMU, closing the gap to closed-source commercial MLLMs. A plausible implication is that systematic scaling of every axis—model, data, inference—sets new standards in transparent multimodal AI system development (Chen et al., 2024, Zhang et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to InternVL 2.5.