InternVL-3.5: Open-Source Multimodal LLM
- InternVL-3.5 is an open-source multimodal LLM family that integrates Cascade RL, Visual Resolution Router, and Decoupled Vision-Language Deployment for efficient and scalable performance.
- It features a diverse range of model variants from 1B to 241B parameters using custom vision encoders and innovative token compression techniques to boost inference speed and reasoning accuracy.
- The advanced training regime, including offline MPO and online GSPO phases, enables significant performance gains, narrowing the gap with leading commercial models such as GPT-5.
InternVL-3.5 is a family of open-source multimodal LLMs (MLLMs) designed to advance versatility, reasoning capability, and inference efficiency within the InternVL series. It systematically integrates architectural innovations and novel training regimes to achieve state-of-the-art performance among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks, substantially narrowing the performance gap with leading commercial models such as GPT-5. The suite spans dense and Mixture-of-Experts (MoE) variants from 1B to 241B parameters and introduces Cascade Reinforcement Learning (Cascade RL), Visual Resolution Router (ViR), and Decoupled Vision-Language Deployment (DvD) as key advances (Wang et al., 25 Aug 2025).
1. Architecture and Model Variants
InternVL-3.5 is structured under the “ViT–MLP–LLM” paradigm, coupling a custom vision encoder (InternViT-300M or InternViT-6B) with a language backbone from the Qwen3 series or GPT-OSS. Images are tokenized into 1024 patches, processed by an MLP fusion layer, and compressed to yield 256 tokens per patch for downstream language processing. Eight primary scales are released (dense: 1B, 2B, 4B, 8B, 14B, 38B; MoE: 20B-A4B, 30B-A3B, 241B-A28B).
InternVL3.5-Flash introduces the Visual Resolution Router (ViR), assigning each patch a compression rate (4× or 16×) based on a learned patch classifier. This allows efficient token utilization by routing semantically simple patches through higher compression without compromising accuracy.
A defining deployment feature is Decoupled Vision-Language Deployment (DvD), which assigns the vision (ViT+MLP+ViR) and language (LLM) stacks to different GPU servers. This arrangement uses an asynchronous three-stage pipeline (vision encoding, feature transmission, LLM prefill and decode), which mitigates computational blocking and maximizes hardware utilization.
2. Cascade Reinforcement Learning (Cascade RL)
InternVL-3.5’s Cascade RL implements a two-stage, coarse-to-fine RL scheme for robust and efficient reasoning improvements beyond supervised fine-tuning:
- Offline RL (“Warm-up”) — Mixed Preference Optimization (MPO): This leverages weighted loss functions, combining DPO preference loss (), BCO quality loss (), and standard next-token generation loss ():
This stage produces stable, high-quality outputs independent of live environment interaction.
- Online RL (Fine-tuning) — Geometric-Sampled PPO (GSPO): For each query , the model generates responses , calculates normalized advantage,
and minimizes
with the importance ratio:
The staged approach ensures training stability and higher asymptotic performance at reduced GPU cost compared to pure online RL.
3. Visual Resolution Router (ViR) and Visual Consistency Learning
ViR enables adaptive visual token compression, implemented via a two-step Visual Consistency Learning (ViCO) protocol:
- Consistency Training: Minimizes KL-divergence between model outputs conditioned on high-resolution tokens versus randomly compressed tokens:
- Router Training: The router binary classifier is trained (with primary model parameters frozen) on a metric measuring consistency loss ratio:
where threshold is dynamically tuned.
ViR achieves ≈50% token reduction with negligible (<0.5%) loss in reasoning or understanding accuracy. Coupling ViR with DvD leads to a reported 4.05× end-to-end inference speedup (Wang et al., 25 Aug 2025).
4. Decoupled Vision-Language Deployment (DvD)
DvD addresses inefficiencies in traditional MLLM architectures that execute vision and language stacks sequentially on a single device, leading to resource under-utilization. DvD segregates the vision and language computation onto dedicated GPU servers, operating asynchronously:
- Vision server: Processes batched images through ViT+MLP(+ViR), yielding BF16-embedded features.
- Language server: Receives embeddings, prepends them to context, proceeds with autoregressive decoding.
The three-stage (vision→transmit→LLM) pipeline obscures latency by overlapping vision and language computation, delivering up to 2× throughput improvement (DvD alone), and up to 3.5–4.0× when jointly leveraging ViR on 8×A100 GPUs (Table 6) (Wang et al., 25 Aug 2025).
5. Empirical Performance
InternVL-3.5 establishes state-of-the-art results among open-source MLLMs and closely approaches commercial leaders on multimodal, reasoning, and agentic tasks. Table highlights (8B, unless noted):
| Capability / Task | InternVL3.5 | Previous Best / Comparison | Metric |
|---|---|---|---|
| MMMU reasoning | 73.4 % | InternVL3 62.7 % | +10.7 % |
| MathVista | 78.4 % | InternVL3 71.6 % | +6.8 % |
| Overall reasoning accuracy | 60.3 % | InternVL3 44.3 % | +16.0 % |
| Inference speed (896px) | 10.97 req/s | InternVL3 2.71 req/s | ×4.05 |
| GUI grounding (241B) | 92.9 % | Seed1.5-VL 95.2 % | – |
| Embodied (VSI-Bench, 241B) | 69.5 % | GPT-5 65.7 % | – |
The top-tier model (InternVL3.5-241B-A28B) narrows the gap with GPT-5 in overall general multimodal (74.1% vs 74.0%) and reasoning (66.9% vs 74.1%) performance. Incremental advances are also reported on text tasks (+6.7 points at 0.6B scale, +2.3 at 241B) (Wang et al., 25 Aug 2025).
6. Training Regime and Technical Infrastructure
Pre-training uses 116M samples (~250B tokens) at a text:multimodal ratio of 1:2.5 (32k max sequence length, square-averaged loss). Supervised fine-tuning employs 56M samples (~130B tokens) and a 1:3.5 ratio, with large context windows and a curriculum of “Thinking” multi-step reasoning, GUI, embodied, and SVG datasets.
Cascade RL is implemented as offline MPO (200k MMPR-v1.2 preference pairs, with per-model weighting) and online GSPO (70k MMPR-Tiny queries, ε=0.2, G≈8 rollouts/query). ViCO employs uniformly sampled compression factors and router thresholds dynamically set via sliding-window percentiles.
Scalable training and inference are supported by FSDP sharding, FP8 arithmetic, FlashAttention-3, Triton MoE kernels, and asynchronous RDMA/TCP for DvD.
7. Distinctive Capabilities and Implications
InternVL-3.5 expands MLLM capabilities to include fine-grained GUI interaction (e.g., ScreenSpot-v2, WindowsAgentArena), SVG document understanding (SGP-Bench), and embodied agentic reasoning (VSI-Bench). Its advances in reasoning, efficiency, and hardware scaling constitute among the strongest open-source benchmarks to date for multi-modal intelligence. A plausible implication is that the combination of Cascade RL, ViR, and DvD will become a foundational blueprint for future multimodal model design as the field continues to pursue both scaling and efficiency alongside real-world agency (Wang et al., 25 Aug 2025).