InternVL3: Open-Source Multimodal LLM

Updated 11 November 2025

InternVL3 is a family of open-source multimodal large language models that integrate vision and language through native, single-stage pretraining.
It employs an innovative ViT-MLP-LLM architecture with variable visual position encoding to efficiently align and process extended multimodal inputs.
Advanced techniques like mixed preference optimization and supervised fine-tuning boost its performance, setting new benchmarks against proprietary systems.

InternVL3 is a family of open-source multimodal LLMs (MLLMs) that unifies vision and language modeling within a native pre-training paradigm. Diverging from the incremental multimodal adaptation strategies seen in earlier models, InternVL3’s architecture, training objectives, and empirical optimization methods address the limitations of modular and post-hoc approaches. Noteworthy for its single-stage multimodal pretraining, scalable backbone, advanced preference optimization, and variable visual position encoding (V2PE), InternVL3 sets a state-of-the-art benchmark for open-source vision–LLMs, achieving competitive performance against leading proprietary systems across a range of multimodal, reasoning, and language tasks.

1. Model Architecture and Multimodal Pre-Training

InternVL3 implements a “ViT-MLP-LLM” backbone. The vision encoder (InternViT, 300M or 6B parameters) extracts patch embeddings from visual input, which are projected and merged with language tokens by a random-initialized two-layer MLP. The LLM is typically a pre-trained LLM such as Qwen2.5 or InternLM3. Pixel-unshuffle compresses high-resolution images (e.g., 448×448 pixels) into 256 visual tokens. All modules—ViT, MLP, LLM—are trainable from the outset of pre-training, enabling cohesive multimodal learning.

A training sample $\mathbf{x}=(x_1, \dots, x_L)$ integrates both text and visual tokens, with position indices $p_i$ provided by the V2PE scheme (see Section 2). The unified training procedure exposes InternVL3 to $\sim$ 50B pure-text and $\sim$ 150B multimodal tokens in a single pre-training stage, with no intermediary text-only adaptation. The core autoregressive loss is

$\mathcal{L}_{\mathrm{full}}(\theta) = -\sum_{i=2}^L w_i\;\log p_\theta(x_i\mid x_{<i}),$

with $w_i = 1/\ell^{0.5}$ for sample length $\ell$ , which balances for sequence length bias. The optimization goal is

$\theta^* = \arg\min_\theta \mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{\mathrm{multi}}} \left[\mathcal{L}_\mathrm{LM}(\theta)\right].$

This architecture allows InternVL3 to natively align cross-modal representations, resulting in improved multimodal reasoning proficiency (Zhu et al., 14 Apr 2025).

2. Variable Visual Position Encoding (V2PE) and Long-Context Processing

Integrating dense visual tokens alongside language tokens can rapidly exhaust a model’s context window due to fixed-step positional encoding. InternVL3 addresses this with V2PE, a recursive scheme that assigns for each token $x_i$ : $p_1=0,\quad p_i = p_{i-1} + \begin{cases} 1 & \text{if } x_i \text{ is textual} \ \delta & \text{if } x_i \text{ is visual} \end{cases}$ with $\delta<1$ drawn from $\Delta = \{1, \frac{1}{2}, \frac{1}{4}, ..., \frac{1}{256}\}$ , chosen per image. This compresses the position space of visual tokens and preserves room for long text sequences within the same window. During inference, $\delta$ is adaptively tuned to fit window constraints. Visual position embeddings are computed as sinusoidal functions over $p_i$ . V2PE is directly invoked in model self-attention and fusion layers, allowing robust processing of extended multimodal contexts (Zhu et al., 14 Apr 2025).

3. Post-Training, Optimization, and Test-Time Strategies

InternVL3 applies advanced post-training strategies to augment base capabilities:

Supervised Fine-Tuning (SFT): InternVL3 uses $\sim$ 21.7M curated, high-quality multimodal instruction samples, leveraging data augmentation (random JPEG, multimodal packing) to improve generalization.
Mixed Preference Optimization (MPO): This combines three loss terms—preference (DPO), quality (BCO), and generation (next-token LM)—to optimize both preference alignment and task accuracy. For each model scale, gains range from +2.9 to +4.5 points in overall multimodal reasoning (Zhu et al., 14 Apr 2025).
Test-Time Best-of-N and VisualPRM Critic: At inference, InternVL3 can generate $N$ candidate responses, scoring them with a pretrained Visual Process Reward Model (VisualPRM-8B). The final prediction is chosen as the highest-scoring chain. This procedure significantly improves mathematical and reasoning accuracy, e.g., up to +6 points on MathVerse for 38B-scale models.

Training and inference are orchestrated using the InternEVO framework, which extends ZeRO-style sharding to allow fully decoupled data, tensor, sequence, and pipeline parallelism. This supports context lengths up to $32,000$ tokens and results in $50\%-200\%$ faster training compared to previous InternVL pipelines.

4. Empirical Performance Across Benchmarks

InternVL3 establishes competitive or best-in-class results across standard multimodal and language understanding benchmarks:

Model	MMMU (%)	Overall Multi-Bench (%)	Notes
InternVL3-78B (open)	72.2	54.6	Best open-source
ChatGPT-4o	70.7	47.9	Closed-source
Claude 3.5 Sonnet	75.0	53.9	Closed-source
Gemini 2.5 Pro	69.9	58.5	Closed-source edge

Additionally, InternVL3-8B outperforms Qwen2.5-7B on various language milestones, e.g., reaching 88.4% GSM8K (zero-shot), 89.0% HumanEval (pass@1), and an overall average of 78.9% (Zhu et al., 14 Apr 2025).

In scientific visual QA, InternVL3-8B attains 0.740 ROUGE-1/ROUGE-L and 0.983 BERTScore on the SciVQA 2025 test split, outperforming next-best single models (Bespoke, Qwen2.5-VL) by margins of +0.031 ROUGE-1 F1 and +0.004 BERTScore. Error analysis identifies strengths in cross-chart grounding and numerical precision, with weaknesses in subfigure scaling misalignments and logical errors on multi-hop questions (Movva et al., 8 Jul 2025).

5. Data, Preprocessing, and Prompt Engineering

Input preprocessing involves resizing figures to a 1024px maximum dimension, applying center-cropping/padding to $1024\times1024$ , and contrast normalization for textual/numerical clarity. Prompt construction concatenates captions, metadata, and questions with special tokens:

1	<FIG_CAP> [caption] </FIG_CAP> <TYPE> [figure_type] </TYPE> <Q> [question] </Q>

Tokenization uses a SentencePiece vocabulary shared across modalities. Alignment between visual patches and textual tokens is maintained with cross-modal position embeddings.

InternVL3’s high SciVQA performance is heavily dependent on prompt optimization and two-stage Chain-of-Thought (CoT) prompting: an “Initial Analysis” followed by structured answer extraction, leveraging XML tags such as <reasoning>...</reasoning> and <answer>...</answer>. This approach enforces concise, context-minimal output (numerical value or single word/phrase) and elicits improved intermediate reasoning and verifiability (Movva et al., 8 Jul 2025).

6. Subsequent Extensions: InternVL3.5 and Beyond

InternVL3.5 introduces significant architectural and algorithmic extensions:

Cascade RL: Two-stage reinforcement learning (offline MPO, then online GSPO) adds +16% reasoning accuracy compared to InternVL3, with monotonic performance improvements at all model scales.
Visual Resolution Router (ViR): Enables dynamic per-patch compression (4× or 16×), mediated by a lightweight router head trained via cross-entropy, achieving ~50% reduction in visual tokens with under 1% performance loss.
Decoupled Vision–Language Deployment (DvD): Partitioning inference between vision and language GPU servers increases throughput up to 4.05×, reducing serialization bottlenecks.
Agentic/GUI Capabilities: InternVL3.5 supports embodied and GUI-interactive tasks, with advances in performance on ScreenSpot-v2, OSWorld-G, WebArena-Lite-v2, VSI-Bench, and more, notably closing the gap with GPT-5 and related proprietary models (Wang et al., 25 Aug 2025).

Model	Multimodal (35 bench)	Reasoning (9)	Text (8)	Agentic (6)	Notes
InternVL3.5-241B-A28B (MoE)	74.1	66.9	85.3	66.2	Open, ties or nears GPT-5
InternVL3-78B	67.9	54.6	79.4	≈60.6
GPT-5	74.0	74.1	91.3	77.5	Closed-source

The introduction of Cascade RL, ViR, and DvD collectively enables both substantial reasoning improvement and real-world inference speedups, ushering in a new phase of openness and efficacy for MLLMs.

7. Limitations and Failure Modes

Despite strong empirical results, InternVL3 exhibits several notable failure modes:

Visual Misinterpretations: Particularly when subfigures differ in scale or when visual elements (text, marks) overlap.
Numerical Misalignments: Manifesting as rounding errors or off-by-one mistakes under degraded visual resolution.
Flawed Multi-Hop Reasoning: Errors when intermediate steps require domain-specific or non-visual reasoning.

Current model variants are not fine-tuned for specific downstream domains such as SciVQA; performance relies primarily on inference-time prompt engineering. This suggests opportunities for further gains through domain-adaptive fine-tuning or higher-resolution visual processing.

A plausible implication is that, as model and infrastructure scaling continue, practitioner emphasis may shift toward optimizing context adaptation (e.g., V2PE strategies), advanced preference-based learning, and efficient multimodal deployment frameworks.