InternVL 3.5: Advancing Multimodal AI

Updated 26 August 2025

The paper introduces a Cascade Reinforcement Learning framework that optimizes multimodal reasoning through staged offline and online RL with combined losses.
InternVL 3.5 employs a Visual Resolution Router that dynamically compresses image tokens by roughly 50% while maintaining benchmark performance.
The Decoupled Vision-Language Deployment strategy separates processing on dedicated GPUs to achieve up to 4.05× inference speedup and efficient resource use.

InternVL 3.5 is an open-source family of multimodal models that advances the state of vision-language understanding, reasoning, and practical deployment. Building on the InternVL lineage, InternVL 3.5 introduces several key architectural and algorithmic innovations designed to improve reasoning capability, versatility across tasks, and computational efficiency, while maintaining competitive performance with leading commercial models.

1. Cascade Reinforcement Learning Framework

A central contribution in InternVL 3.5 is the Cascade Reinforcement Learning (Cascade RL) training framework, which enhances the model’s reasoning capability for downstream multimodal tasks. Cascade RL operates in two stages:

Offline RL: The model undergoes reinforcement optimization using pre-collected offline rollouts. The objective combines preference, quality, and generation losses:

$\mathcal{L}_{\rm MPO} = w_p\,\mathcal{L}_p + w_q\,\mathcal{L}_q + w_g\,\mathcal{L}_g$

where $w_p$ , $w_q$ , and $w_g$ are balancing weights for each loss, and the phase yields stable convergence and robust reasoning behavior.

Online RL (GSPO algorithm): After offline RL, the model refines its policy using online rollouts and normalized rewards. For each query $x$ , candidate responses $y_i$ are assigned normalized rewards:

$\hat{A}_i = \frac{r(x, y_i) - \mathrm{mean}(\{r(x, y_j)\})}{\mathrm{std}(\{r(x, y_j)\})}$

The GSPO objective minimizes:

$\mathcal{L}_{\rm GSPO}(\theta) = \mathbb{E}_{x,{y_i}}\left[ \frac{1}{G} \sum_{i=1}^{G} \min(s_i(\theta)\hat{A}_i, \mathrm{clip}(s_i(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_i) \right]$

where $s_i(\theta)$ are importance weights from policy ratios. This staged RL approach enables InternVL 3.5 to prune low-quality outputs and push reasoning performance on benchmarks such as MMMU and MathVista upward by significant margins.

2. Visual Resolution Router (ViR) for Dynamic Token Compression

To address the computational overhead in multimodal models, InternVL 3.5 introduces the Visual Resolution Router (ViR) module. ViR dynamically determines the compression rate for each image patch:

Routing Decision: For each patch, ViR evaluates whether semantic content allows for high compression (down to 64 tokens) or requires lower compression (256 tokens).
Impact: This adaptive strategy reduces token counts by roughly 50% without measurable loss in benchmark performance across document and OCR domains.
Efficiency Gains: ViR directly accelerates inference, allowing InternVL 3.5 to process high-resolution inputs with nearly linear speedup relative to token reduction.

3. Decoupled Vision-Language Deployment (DvD) Strategy

InternVL 3.5 implements a Decoupled Vision-Language Deployment (DvD) method to optimize hardware resource utilization:

Separation of Modules: The vision encoder and associated modules (pixel shuffle, ViR) run on dedicated "vision server" GPUs, and the LLM operates on separate "language server" GPUs.
Batching and Parallelization: Visual embeddings are computed in parallel and transmitted in compressed format (BF16) to the language server, which asynchronously processes autoregressive decoding.
Latency and Throughput: By overlapping vision and language computations, DvD achieves nearly 2× throughput alone, and up to 4.05× speedup when paired with ViR.

4. Reasoning Performance and Inference Efficiency

InternVL 3.5 demonstrates state-of-the-art multimodal reasoning metrics and superior inference speed:

Model/Variant	Reasoning Improvement over InternVL3	Inference Speedup
InternVL3.5-8B	+16.0% (MMMU/MathVista)	Up to 4.05×
InternVL3.5-241B-A28B	+16.0% (general tasks)	Up to 4.05×

The combination of Cascade RL, ViR, and DvD delivers substantial quantitative gains not only in overall performance (e.g., >16% on reasoning tasks) but also in computational efficiency, supporting real-time deployment at scale.

5. Novel Capabilities: GUI Interaction and Embodied Agency

Beyond traditional vision-language tasks, InternVL 3.5 is architected and trained for new domains:

GUI Grounding and Interaction: Specialized training enables grounding of graphical user interface elements, supporting automated control and instruction-following in interactive environments.
Embodied Agency: The model demonstrates robust spatial reasoning in dynamic and multi-modal contexts, paving the way for applications in robotic perception and interactive agents.

This extension of capabilities is realized through targeted datasets and model adaptation, expanding the utility of InternVL 3.5 to embodied, agentic tasks.

6. Comparative Analysis with Leading Commercial Models

The largest InternVL3.5 configuration (241B-A28B) positions itself among the most competitive open-source MLLMs:

Benchmark Proximity: On comprehensive tasks, the model achieves scores within 3.9% of GPT-5—bridging the gap between open-source and state-of-the-art commercial systems.
Task Breadth: Performance covers general multimodal understanding, advanced reasoning, pure language tasks, and agentic interaction, reflecting broad applicability.

A plausible implication is that open-source models with modular training and deployment strategies, such as InternVL3.5, are now approaching parity with leading proprietary MLLMs.

7. Open Source Release and Research Directions

InternVL 3.5 is publicly released with all supporting code and model weights, promoting transparency and reproducibility. Its combination of efficient RL-based training, dynamic token compression (ViR), and hardware-optimized deployment (DvD) suggests promising future directions:

Fine-grained token and computation routing for further speedups
Extension to multimodal control and interactive agency domains
Continued narrowing of the performance gap with closed-source systems via algorithmic innovation and data scaling

InternVL 3.5 marks an advance in the design, reasoning capacity, and efficiency of large multimodal models and sets new standards for open-source AI frameworks.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to InternVL 3.5.