InternVL3: Open-Source Multimodal LLM
- InternVL3 is a family of open-source multimodal large language models that integrate vision and language through native, single-stage pretraining.
- It employs an innovative ViT-MLP-LLM architecture with variable visual position encoding to efficiently align and process extended multimodal inputs.
- Advanced techniques like mixed preference optimization and supervised fine-tuning boost its performance, setting new benchmarks against proprietary systems.
InternVL3 is a family of open-source multimodal LLMs (MLLMs) that unifies vision and language modeling within a native pre-training paradigm. Diverging from the incremental multimodal adaptation strategies seen in earlier models, InternVL3’s architecture, training objectives, and empirical optimization methods address the limitations of modular and post-hoc approaches. Noteworthy for its single-stage multimodal pretraining, scalable backbone, advanced preference optimization, and variable visual position encoding (V2PE), InternVL3 sets a state-of-the-art benchmark for open-source vision–LLMs, achieving competitive performance against leading proprietary systems across a range of multimodal, reasoning, and language tasks.
1. Model Architecture and Multimodal Pre-Training
InternVL3 implements a “ViT-MLP-LLM” backbone. The vision encoder (InternViT, 300M or 6B parameters) extracts patch embeddings from visual input, which are projected and merged with language tokens by a random-initialized two-layer MLP. The LLM is typically a pre-trained LLM such as Qwen2.5 or InternLM3. Pixel-unshuffle compresses high-resolution images (e.g., 448×448 pixels) into 256 visual tokens. All modules—ViT, MLP, LLM—are trainable from the outset of pre-training, enabling cohesive multimodal learning.
A training sample integrates both text and visual tokens, with position indices provided by the V2PE scheme (see Section 2). The unified training procedure exposes InternVL3 to 50B pure-text and 150B multimodal tokens in a single pre-training stage, with no intermediary text-only adaptation. The core autoregressive loss is
with for sample length , which balances for sequence length bias. The optimization goal is
This architecture allows InternVL3 to natively align cross-modal representations, resulting in improved multimodal reasoning proficiency (Zhu et al., 14 Apr 2025).
2. Variable Visual Position Encoding (V2PE) and Long-Context Processing
Integrating dense visual tokens alongside language tokens can rapidly exhaust a model’s context window due to fixed-step positional encoding. InternVL3 addresses this with V2PE, a recursive scheme that assigns for each token : with drawn from , chosen per image. This compresses the position space of visual tokens and preserves room for long text sequences within the same window. During inference, is adaptively tuned to fit window constraints. Visual position embeddings are computed as sinusoidal functions over . V2PE is directly invoked in model self-attention and fusion layers, allowing robust processing of extended multimodal contexts (Zhu et al., 14 Apr 2025).
3. Post-Training, Optimization, and Test-Time Strategies
InternVL3 applies advanced post-training strategies to augment base capabilities:
- Supervised Fine-Tuning (SFT): InternVL3 uses 21.7M curated, high-quality multimodal instruction samples, leveraging data augmentation (random JPEG, multimodal packing) to improve generalization.
- Mixed Preference Optimization (MPO): This combines three loss terms—preference (DPO), quality (BCO), and generation (next-token LM)—to optimize both preference alignment and task accuracy. For each model scale, gains range from +2.9 to +4.5 points in overall multimodal reasoning (Zhu et al., 14 Apr 2025).
- Test-Time Best-of-N and VisualPRM Critic: At inference, InternVL3 can generate candidate responses, scoring them with a pretrained Visual Process Reward Model (VisualPRM-8B). The final prediction is chosen as the highest-scoring chain. This procedure significantly improves mathematical and reasoning accuracy, e.g., up to +6 points on MathVerse for 38B-scale models.
Training and inference are orchestrated using the InternEVO framework, which extends ZeRO-style sharding to allow fully decoupled data, tensor, sequence, and pipeline parallelism. This supports context lengths up to $32,000$ tokens and results in faster training compared to previous InternVL pipelines.
4. Empirical Performance Across Benchmarks
InternVL3 establishes competitive or best-in-class results across standard multimodal and language understanding benchmarks:
| Model | MMMU (%) | Overall Multi-Bench (%) | Notes |
|---|---|---|---|
| InternVL3-78B (open) | 72.2 | 54.6 | Best open-source |
| ChatGPT-4o | 70.7 | 47.9 | Closed-source |
| Claude 3.5 Sonnet | 75.0 | 53.9 | Closed-source |
| Gemini 2.5 Pro | 69.9 | 58.5 | Closed-source edge |
Additionally, InternVL3-8B outperforms Qwen2.5-7B on various language milestones, e.g., reaching 88.4% GSM8K (zero-shot), 89.0% HumanEval (pass@1), and an overall average of 78.9% (Zhu et al., 14 Apr 2025).
In scientific visual QA, InternVL3-8B attains 0.740 ROUGE-1/ROUGE-L and 0.983 BERTScore on the SciVQA 2025 test split, outperforming next-best single models (Bespoke, Qwen2.5-VL) by margins of +0.031 ROUGE-1 F1 and +0.004 BERTScore. Error analysis identifies strengths in cross-chart grounding and numerical precision, with weaknesses in subfigure scaling misalignments and logical errors on multi-hop questions (Movva et al., 8 Jul 2025).
5. Data, Preprocessing, and Prompt Engineering
Input preprocessing involves resizing figures to a 1024px maximum dimension, applying center-cropping/padding to , and contrast normalization for textual/numerical clarity. Prompt construction concatenates captions, metadata, and questions with special tokens:
1 |
<FIG_CAP> [caption] </FIG_CAP> <TYPE> [figure_type] </TYPE> <Q> [question] </Q> |
InternVL3’s high SciVQA performance is heavily dependent on prompt optimization and two-stage Chain-of-Thought (CoT) prompting: an “Initial Analysis” followed by structured answer extraction, leveraging XML tags such as <reasoning>...</reasoning> and <answer>...</answer>. This approach enforces concise, context-minimal output (numerical value or single word/phrase) and elicits improved intermediate reasoning and verifiability (Movva et al., 8 Jul 2025).
6. Subsequent Extensions: InternVL3.5 and Beyond
InternVL3.5 introduces significant architectural and algorithmic extensions:
- Cascade RL: Two-stage reinforcement learning (offline MPO, then online GSPO) adds +16% reasoning accuracy compared to InternVL3, with monotonic performance improvements at all model scales.
- Visual Resolution Router (ViR): Enables dynamic per-patch compression (4× or 16×), mediated by a lightweight router head trained via cross-entropy, achieving ~50% reduction in visual tokens with under 1% performance loss.
- Decoupled Vision–Language Deployment (DvD): Partitioning inference between vision and language GPU servers increases throughput up to 4.05×, reducing serialization bottlenecks.
- Agentic/GUI Capabilities: InternVL3.5 supports embodied and GUI-interactive tasks, with advances in performance on ScreenSpot-v2, OSWorld-G, WebArena-Lite-v2, VSI-Bench, and more, notably closing the gap with GPT-5 and related proprietary models (Wang et al., 25 Aug 2025).
| Model | Multimodal (35 bench) | Reasoning (9) | Text (8) | Agentic (6) | Notes |
|---|---|---|---|---|---|
| InternVL3.5-241B-A28B (MoE) | 74.1 | 66.9 | 85.3 | 66.2 | Open, ties or nears GPT-5 |
| InternVL3-78B | 67.9 | 54.6 | 79.4 | ≈60.6 | |
| GPT-5 | 74.0 | 74.1 | 91.3 | 77.5 | Closed-source |
The introduction of Cascade RL, ViR, and DvD collectively enables both substantial reasoning improvement and real-world inference speedups, ushering in a new phase of openness and efficacy for MLLMs.
7. Limitations and Failure Modes
Despite strong empirical results, InternVL3 exhibits several notable failure modes:
- Visual Misinterpretations: Particularly when subfigures differ in scale or when visual elements (text, marks) overlap.
- Numerical Misalignments: Manifesting as rounding errors or off-by-one mistakes under degraded visual resolution.
- Flawed Multi-Hop Reasoning: Errors when intermediate steps require domain-specific or non-visual reasoning.
Current model variants are not fine-tuned for specific downstream domains such as SciVQA; performance relies primarily on inference-time prompt engineering. This suggests opportunities for further gains through domain-adaptive fine-tuning or higher-resolution visual processing.
A plausible implication is that, as model and infrastructure scaling continue, practitioner emphasis may shift toward optimizing context adaptation (e.g., V2PE strategies), advanced preference-based learning, and efficient multimodal deployment frameworks.