InternVL: Open-Source Vision–Language Models

Updated 26 September 2025

InternVL is a suite of open-source vision–language models that combines high-capacity vision encoders with language middleware for scalable multimodal reasoning.
It employs a two-phase pretraining strategy with contrastive, image–text matching, and generative losses to attain state-of-the-art performance on diverse benchmarks.
The InternVL series enhances efficiency and scalability through innovations such as dynamic token compression, progressive training, and native multimodal pretraining.

InternVL is a family of open-source large-scale vision–language foundation models that integrate high-capacity vision encoders with LLMs to address a broad spectrum of visual–linguistic and multimodal reasoning tasks. Initially introduced with a 6B-parameter Vision Transformer (ViT) encoder and a robust language middleware (QLLaMA), InternVL pioneered scalable cross-modal alignment, progressive training strategies, and broad task generalization. Successive versions—including InternVL 1.5, 2.5, X, and InternVL3—establish the series as a reference point for open-source multimodal research, covering domains from image, video, scientific, and document understanding to 3D object retrieval, autonomous driving, and biomedical report correction. Below, key facets of the InternVL series are documented, spanning architectural principles, advancements in efficiency, benchmark results, open challenges, and domains of deployment.

1. Model Architecture, Core Innovations, and Training Paradigms

InternVL is defined by modular integration of a large vision encoder—originally InternViT-6B, a vanilla ViT scaled up to 6B parameters—and a language middleware (QLLaMA). The initial architecture connects raw image inputs $I \in \mathbb{R}^{H \times W \times 3}$ through $f_V(\cdot)$ to global image embeddings $I_f$ , while textual prompts $T$ are encoded via $f_T(\cdot)$ to $T_f$ . A cross-modal "adapter" (with learnable queries and cross-attention) enables bidirectional transfer between modalities.

Two-phase pretraining is employed:

Stage 1: CLIP-style symmetric contrastive loss on web-scale noisy image–text pairs aligns vision and language representations:

$\mathcal{L} = -\log \frac{\exp(\operatorname{sim}(I_f, T_f)/\tau)}{\sum_j \exp(\operatorname{sim}(I_f, T_f^{(j)})/\tau)}$

Stage 2: Generative training, with frozen vision encoder and QLLaMA, refines cross-modal alignment using much cleaner image–text corpora and a blend of losses (contrastive, image–text matching, text generation).

In later generations, this architecture is systematically enhanced:

InternVL 1.5: Features continuous learning for the vision encoder, dynamic high-resolution tiling (images partitioned up to 4K resolution), and a high-quality bilingual dataset (English/Chinese), substantially boosting robust OCR and scene understanding.
InternVL X: Introduces three compression modules: PVTC (dual-path local/global projector), LVTC (layer-wise visual token compression and expansion), and RVTC (adaptive token allocation by area or length). These allow training and inference on high-resolution visual inputs using only 20% or fewer tokens, boosting efficiency while improving average benchmark accuracy by ~2.3% over previous versions.
InternVL3: Embodies truly native multimodal pretraining (no post-hoc alignment). Variable Visual Position Encoding (V2PE) allows encoding very long visual sequences (e.g., $p_i = p_{i-1} + \delta$ , with fine-grained $\delta$ ). Supervised fine-tuning and Mixed Preference Optimization (MPO; loss $\mathcal{L}_\text{total} = w_p \mathcal{L}_p + w_q \mathcal{L}_q + w_g \mathcal{L}_g$ ) further alleviate distribution shifts and bolster chain-of-thought (CoT) capabilities.

InternVL’s training sets are exceptionally broad: 6+ billion image–text pairs drawn from LAION-en, LAION-multi, synthetic Laion-COCO, COYO, and academic caption collections (CC3M, CC12M, SBU). Rigorous filtering (for caption quality, duplication, safety) yields ~5B pretraining instances, with generative stages distilled to ~1B high-quality examples.

Alignment between image and text is achieved by:

Deep scaling of both vision (InternViT-6B) and language (QLLaMA or larger LLMs).
Multi-stage objectives (contrastive, matching, generative) to bridge "representation gaps".
Progressive transfer to downstream LLMs via middleware, supporting both discriminative (contrastive) and generative (captioning, dialogue) workflows.

In InternVL3, all parameters are jointly optimized for both language-only and vision–language tasks, avoiding the pitfalls of freezing or staged multimodal alignment.

3. Efficiency, Parameter Scaling, and Transferability

The series addresses efficiency via several fronts:

Mini-InternVL distills the InternViT-6B into a 300M-parameter visual backbone, using pixel unshuffle for 4× token reduction, yielding 1–4B parameter versions that preserve up to 90% of the flagship model’s task performance, facilitating deployment on edge devices.
InternVL-X goes further by integrating PVTC, LVTC, and RVTC. Local/global dual queries (PVTC) maintain information density, while adaptive slicing (RVTC) dynamically assigns token budgets. Layer-wise compression (LVTC) limits quadratic transformer costs to regions and depths necessary for high-level reasoning.
Standardized VQA-conversational data templates and full-parameter fine-tuning enable rapid transfer across domains (autonomous driving, remote sensing, medical imaging).

4. Task Performance and Domain Benchmarks

InternVL models have established competitive or state-of-the-art performance on an array of public benchmarks:

Visual–linguistic: State-of-the-art on 32+ tasks (ImageNet-1K, ADE20K, Flickr30k, COCO, NoCaps, MME, POPE) (Chen et al., 2023).
Multitask AGI: InternVL-Chat-V1.2-34B yields ~63.4% on MMT-Bench, surpassing major closed models (GPT-4V, GeminiProVision) (Ying et al., 24 Apr 2024).
Video understanding: InternVL-Chat-V1.5-20B matches video-specialized models on short clips (61.2–62.4% accuracy, Video-MME), but performance declines (to ~46–47%) on longer videos due to limited temporal context (Fu et al., 31 May 2024).
Autonomous driving: Full-model fine-tuning on multiview nuScenes images gives 0.6002 leaderboard score in DriveLM (Li et al., 10 Dec 2024). Semi-supervised pipelines leveraging template-driven pseudo-answers and Self-Consistency Refinement elevate performance from 44.85% (5% labeled) to 54.27% (Wang et al., 13 Mar 2025).
3D object retrieval: When paired with CLIP in the TeDA framework, InternVL-generated object descriptions boost 3D retrieval performance by up to 8.4% in mean average precision compared to purely visual methods (Wang et al., 5 May 2025).
Medical domain: On CorBenchX (chest X-ray report correction), InternVL3-8B attains robust but non-leading correction scores (BLEU 0.768), lagging behind methods augmented by multi-step reinforcement learning (Zou et al., 17 May 2025).
Scientific charts/reasoning: On ClimateViz, InternVL 2.5 achieves up to 77.8% accuracy with structured (chart+table+text) inputs; explanation-augmented outputs yield modest improvements, but human parity remains elusive (Su et al., 10 Jun 2025).
Physics reasoning: On intuitive physics tasks (GRASP, IntPhys 2), even the 78B model only approaches 54% accuracy, with diagnostic probes revealing that vision encoders capture physical plausibility cues that are partially lost in language integration, highlighting vision–language misalignment (Ballout et al., 22 Jul 2025).
Scientific VQA: InternVL-3 achieves 26.5% (LLM-judge) score on SFE, outperforming previous open-models on comparative reasoning (L3) but underperforming in signal perception and attribute understanding (Zhou et al., 12 Jun 2025).

5. Limitations, Bias, and Ethical Considerations

Notable limitations include:

Cursive and low-resource text OCR: InternVL yields negative accuracy and BLEU scores on Pashto OCR, far underperforming both open and closed-source models (Haq et al., 15 May 2025).
Domain bias: The VLA bias assessment (Girrbach et al., 25 Oct 2024) shows InternVL variants systematically reproduce human and societal gender stereotypes, especially in occupation. Debiasing via full or LoRA fine-tuning can halve or nearly eliminate bias but at the cost of general performance.
Scientific perception and interpretability: In realistic scientific scenarios (Chemistry/Physics VQA, chart-based fact-checking), InternVL and peers remain substantially behind both closed models (Gemini, GPT-4o) and human experts, especially in nuanced perception and canonical explanation generation.
Reasoning bottlenecks: Diagnostic studies on physical and temporal reasoning indicate that InternVL’s vision encoders adequately represent key cues, but language modules and vision–language alignment act as consistent bottlenecks.

6. Architectural and Community Impact with Open Science Commitments

InternVL’s series exemplifies several trends catalyzing progress in open-source multimodal AI:

Full release of code, weights, and datasets (e.g., InternVL3 plans to open data and models) has strengthened transparency, reproducibility, and broad benchmarking.
Modular design (ViT–MLP–LLM), native multimodal pretraining (as in InternVL3), and technical innovations such as V2PE and MPO push scalable, context-length-agnostic architectures.
Systematic efficiency improvements (Mini-InternVL, InternVL-X) have made high-performance multimodal reasoning accessible on commodity or edge hardware.
The model’s adoption in external scientific workflows (autonomous driving, radiology, scientific chart verification) and its integration with hybrid symbolic pipelines (scene graph augmentation for video QA (Ma et al., 15 Sep 2025)) demonstrate wide adaptability.
Chain-of-thought prompting and test-time scaling introduce new axes for practical performance enhancement.

7. Prospects and Future Research Directions

Key open problems and directions for future investigation include:

Vision–language alignment: Systematic refinement of connector modules, possibly via layer-wise diagnostics and bridging architectures, is required to preserve physical/temporal cues for reasoning-intensive tasks.
Bias mitigation: Advanced fine-tuning and data-level interventions are needed to balance debiasing and robust downstream task performance, especially in high-stakes applications.
Scientific and statistical reasoning: Enhanced model inductive biases and pretraining tasks targeting scientific chart, physics, and domain-specific VQA are likely required to close the gap with both proprietary models and expert human reasoning.
Hybrid architectures: Combining continuous VLMs with discrete, interpretable symbolic layers (e.g., scene graphs (Ma et al., 15 Sep 2025)) and structured explanation outputs will be essential for trustworthy, explainable AI systems.
Multi-modal scaling: Further extension to include video, 3D, audio, and extended context-length inputs, leveraging hierarchical and adaptive tokenization.
Open science and reproducibility: Ongoing release of model weights, data filtering pipelines, and detailed recipes is vital for accelerating research adoption and external validation.

InternVL’s trajectory illustrates the evolving convergence of large-scale vision and language modeling, balancing efficiency, generalization, and broad applicability. While setting new standards for open-source multimodal AI, the series simultaneously exposes enduring research challenges in alignment, bias, scientific reasoning, and interpretability, mapping a pathway for future advancements.