Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 169 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

InternVL3: Open-Source Multimodal LLM

Updated 11 November 2025
  • InternVL3 is a family of open-source multimodal large language models that integrate vision and language through native, single-stage pretraining.
  • It employs an innovative ViT-MLP-LLM architecture with variable visual position encoding to efficiently align and process extended multimodal inputs.
  • Advanced techniques like mixed preference optimization and supervised fine-tuning boost its performance, setting new benchmarks against proprietary systems.

InternVL3 is a family of open-source multimodal LLMs (MLLMs) that unifies vision and language modeling within a native pre-training paradigm. Diverging from the incremental multimodal adaptation strategies seen in earlier models, InternVL3’s architecture, training objectives, and empirical optimization methods address the limitations of modular and post-hoc approaches. Noteworthy for its single-stage multimodal pretraining, scalable backbone, advanced preference optimization, and variable visual position encoding (V2PE), InternVL3 sets a state-of-the-art benchmark for open-source vision–LLMs, achieving competitive performance against leading proprietary systems across a range of multimodal, reasoning, and language tasks.

1. Model Architecture and Multimodal Pre-Training

InternVL3 implements a “ViT-MLP-LLM” backbone. The vision encoder (InternViT, 300M or 6B parameters) extracts patch embeddings from visual input, which are projected and merged with language tokens by a random-initialized two-layer MLP. The LLM is typically a pre-trained LLM such as Qwen2.5 or InternLM3. Pixel-unshuffle compresses high-resolution images (e.g., 448×448 pixels) into 256 visual tokens. All modules—ViT, MLP, LLM—are trainable from the outset of pre-training, enabling cohesive multimodal learning.

A training sample x=(x1,,xL)\mathbf{x}=(x_1, \dots, x_L) integrates both text and visual tokens, with position indices pip_i provided by the V2PE scheme (see Section 2). The unified training procedure exposes InternVL3 to \sim50B pure-text and \sim150B multimodal tokens in a single pre-training stage, with no intermediary text-only adaptation. The core autoregressive loss is

Lfull(θ)=i=2Lwi  logpθ(xix<i),\mathcal{L}_{\mathrm{full}}(\theta) = -\sum_{i=2}^L w_i\;\log p_\theta(x_i\mid x_{<i}),

with wi=1/0.5w_i = 1/\ell^{0.5} for sample length \ell, which balances for sequence length bias. The optimization goal is

θ=argminθExDmulti[LLM(θ)].\theta^* = \arg\min_\theta \mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{\mathrm{multi}}} \left[\mathcal{L}_\mathrm{LM}(\theta)\right].

This architecture allows InternVL3 to natively align cross-modal representations, resulting in improved multimodal reasoning proficiency (Zhu et al., 14 Apr 2025).

2. Variable Visual Position Encoding (V2PE) and Long-Context Processing

Integrating dense visual tokens alongside language tokens can rapidly exhaust a model’s context window due to fixed-step positional encoding. InternVL3 addresses this with V2PE, a recursive scheme that assigns for each token xix_i: p1=0,pi=pi1+{1if xi is textual δif xi is visualp_1=0,\quad p_i = p_{i-1} + \begin{cases} 1 & \text{if } x_i \text{ is textual} \ \delta & \text{if } x_i \text{ is visual} \end{cases} with δ<1\delta<1 drawn from Δ={1,12,14,...,1256}\Delta = \{1, \frac{1}{2}, \frac{1}{4}, ..., \frac{1}{256}\}, chosen per image. This compresses the position space of visual tokens and preserves room for long text sequences within the same window. During inference, δ\delta is adaptively tuned to fit window constraints. Visual position embeddings are computed as sinusoidal functions over pip_i. V2PE is directly invoked in model self-attention and fusion layers, allowing robust processing of extended multimodal contexts (Zhu et al., 14 Apr 2025).

3. Post-Training, Optimization, and Test-Time Strategies

InternVL3 applies advanced post-training strategies to augment base capabilities:

  • Supervised Fine-Tuning (SFT): InternVL3 uses \sim21.7M curated, high-quality multimodal instruction samples, leveraging data augmentation (random JPEG, multimodal packing) to improve generalization.
  • Mixed Preference Optimization (MPO): This combines three loss terms—preference (DPO), quality (BCO), and generation (next-token LM)—to optimize both preference alignment and task accuracy. For each model scale, gains range from +2.9 to +4.5 points in overall multimodal reasoning (Zhu et al., 14 Apr 2025).
  • Test-Time Best-of-N and VisualPRM Critic: At inference, InternVL3 can generate NN candidate responses, scoring them with a pretrained Visual Process Reward Model (VisualPRM-8B). The final prediction is chosen as the highest-scoring chain. This procedure significantly improves mathematical and reasoning accuracy, e.g., up to +6 points on MathVerse for 38B-scale models.

Training and inference are orchestrated using the InternEVO framework, which extends ZeRO-style sharding to allow fully decoupled data, tensor, sequence, and pipeline parallelism. This supports context lengths up to $32,000$ tokens and results in 50%200%50\%-200\% faster training compared to previous InternVL pipelines.

4. Empirical Performance Across Benchmarks

InternVL3 establishes competitive or best-in-class results across standard multimodal and language understanding benchmarks:

Model MMMU (%) Overall Multi-Bench (%) Notes
InternVL3-78B (open) 72.2 54.6 Best open-source
ChatGPT-4o 70.7 47.9 Closed-source
Claude 3.5 Sonnet 75.0 53.9 Closed-source
Gemini 2.5 Pro 69.9 58.5 Closed-source edge

Additionally, InternVL3-8B outperforms Qwen2.5-7B on various language milestones, e.g., reaching 88.4% GSM8K (zero-shot), 89.0% HumanEval (pass@1), and an overall average of 78.9% (Zhu et al., 14 Apr 2025).

In scientific visual QA, InternVL3-8B attains 0.740 ROUGE-1/ROUGE-L and 0.983 BERTScore on the SciVQA 2025 test split, outperforming next-best single models (Bespoke, Qwen2.5-VL) by margins of +0.031 ROUGE-1 F1 and +0.004 BERTScore. Error analysis identifies strengths in cross-chart grounding and numerical precision, with weaknesses in subfigure scaling misalignments and logical errors on multi-hop questions (Movva et al., 8 Jul 2025).

5. Data, Preprocessing, and Prompt Engineering

Input preprocessing involves resizing figures to a 1024px maximum dimension, applying center-cropping/padding to 1024×10241024\times1024, and contrast normalization for textual/numerical clarity. Prompt construction concatenates captions, metadata, and questions with special tokens:

1
<FIG_CAP> [caption] </FIG_CAP> <TYPE> [figure_type] </TYPE> <Q> [question] </Q>
Tokenization uses a SentencePiece vocabulary shared across modalities. Alignment between visual patches and textual tokens is maintained with cross-modal position embeddings.

InternVL3’s high SciVQA performance is heavily dependent on prompt optimization and two-stage Chain-of-Thought (CoT) prompting: an “Initial Analysis” followed by structured answer extraction, leveraging XML tags such as <reasoning>...</reasoning> and <answer>...</answer>. This approach enforces concise, context-minimal output (numerical value or single word/phrase) and elicits improved intermediate reasoning and verifiability (Movva et al., 8 Jul 2025).

6. Subsequent Extensions: InternVL3.5 and Beyond

InternVL3.5 introduces significant architectural and algorithmic extensions:

  • Cascade RL: Two-stage reinforcement learning (offline MPO, then online GSPO) adds +16% reasoning accuracy compared to InternVL3, with monotonic performance improvements at all model scales.
  • Visual Resolution Router (ViR): Enables dynamic per-patch compression (4× or 16×), mediated by a lightweight router head trained via cross-entropy, achieving ~50% reduction in visual tokens with under 1% performance loss.
  • Decoupled Vision–Language Deployment (DvD): Partitioning inference between vision and language GPU servers increases throughput up to 4.05×, reducing serialization bottlenecks.
  • Agentic/GUI Capabilities: InternVL3.5 supports embodied and GUI-interactive tasks, with advances in performance on ScreenSpot-v2, OSWorld-G, WebArena-Lite-v2, VSI-Bench, and more, notably closing the gap with GPT-5 and related proprietary models (Wang et al., 25 Aug 2025).
Model Multimodal (35 bench) Reasoning (9) Text (8) Agentic (6) Notes
InternVL3.5-241B-A28B (MoE) 74.1 66.9 85.3 66.2 Open, ties or nears GPT-5
InternVL3-78B 67.9 54.6 79.4 ≈60.6
GPT-5 74.0 74.1 91.3 77.5 Closed-source

The introduction of Cascade RL, ViR, and DvD collectively enables both substantial reasoning improvement and real-world inference speedups, ushering in a new phase of openness and efficacy for MLLMs.

7. Limitations and Failure Modes

Despite strong empirical results, InternVL3 exhibits several notable failure modes:

  • Visual Misinterpretations: Particularly when subfigures differ in scale or when visual elements (text, marks) overlap.
  • Numerical Misalignments: Manifesting as rounding errors or off-by-one mistakes under degraded visual resolution.
  • Flawed Multi-Hop Reasoning: Errors when intermediate steps require domain-specific or non-visual reasoning.

Current model variants are not fine-tuned for specific downstream domains such as SciVQA; performance relies primarily on inference-time prompt engineering. This suggests opportunities for further gains through domain-adaptive fine-tuning or higher-resolution visual processing.

A plausible implication is that, as model and infrastructure scaling continue, practitioner emphasis may shift toward optimizing context adaptation (e.g., V2PE strategies), advanced preference-based learning, and efficient multimodal deployment frameworks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to InternVL3.