Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 33 tok/s Pro
GPT-4o 105 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 214 tok/s Pro
2000 character limit reached

InternVL 3.5: Advancing Multimodal AI

Updated 26 August 2025
  • The paper introduces a Cascade Reinforcement Learning framework that optimizes multimodal reasoning through staged offline and online RL with combined losses.
  • InternVL 3.5 employs a Visual Resolution Router that dynamically compresses image tokens by roughly 50% while maintaining benchmark performance.
  • The Decoupled Vision-Language Deployment strategy separates processing on dedicated GPUs to achieve up to 4.05× inference speedup and efficient resource use.

InternVL 3.5 is an open-source family of multimodal models that advances the state of vision-language understanding, reasoning, and practical deployment. Building on the InternVL lineage, InternVL 3.5 introduces several key architectural and algorithmic innovations designed to improve reasoning capability, versatility across tasks, and computational efficiency, while maintaining competitive performance with leading commercial models.

1. Cascade Reinforcement Learning Framework

A central contribution in InternVL 3.5 is the Cascade Reinforcement Learning (Cascade RL) training framework, which enhances the model’s reasoning capability for downstream multimodal tasks. Cascade RL operates in two stages:

  • Offline RL: The model undergoes reinforcement optimization using pre-collected offline rollouts. The objective combines preference, quality, and generation losses:

LMPO=wpLp+wqLq+wgLg\mathcal{L}_{\rm MPO} = w_p\,\mathcal{L}_p + w_q\,\mathcal{L}_q + w_g\,\mathcal{L}_g

where wpw_p, wqw_q, and wgw_g are balancing weights for each loss, and the phase yields stable convergence and robust reasoning behavior.

  • Online RL (GSPO algorithm): After offline RL, the model refines its policy using online rollouts and normalized rewards. For each query xx, candidate responses yiy_i are assigned normalized rewards:

A^i=r(x,yi)mean({r(x,yj)})std({r(x,yj)})\hat{A}_i = \frac{r(x, y_i) - \mathrm{mean}(\{r(x, y_j)\})}{\mathrm{std}(\{r(x, y_j)\})}

The GSPO objective minimizes:

LGSPO(θ)=Ex,yi[1Gi=1Gmin(si(θ)A^i,clip(si(θ),1ϵ,1+ϵ)A^i)]\mathcal{L}_{\rm GSPO}(\theta) = \mathbb{E}_{x,{y_i}}\left[ \frac{1}{G} \sum_{i=1}^{G} \min(s_i(\theta)\hat{A}_i, \mathrm{clip}(s_i(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_i) \right]

where si(θ)s_i(\theta) are importance weights from policy ratios. This staged RL approach enables InternVL 3.5 to prune low-quality outputs and push reasoning performance on benchmarks such as MMMU and MathVista upward by significant margins.

2. Visual Resolution Router (ViR) for Dynamic Token Compression

To address the computational overhead in multimodal models, InternVL 3.5 introduces the Visual Resolution Router (ViR) module. ViR dynamically determines the compression rate for each image patch:

  • Routing Decision: For each patch, ViR evaluates whether semantic content allows for high compression (down to 64 tokens) or requires lower compression (256 tokens).
  • Impact: This adaptive strategy reduces token counts by roughly 50% without measurable loss in benchmark performance across document and OCR domains.
  • Efficiency Gains: ViR directly accelerates inference, allowing InternVL 3.5 to process high-resolution inputs with nearly linear speedup relative to token reduction.

3. Decoupled Vision-Language Deployment (DvD) Strategy

InternVL 3.5 implements a Decoupled Vision-Language Deployment (DvD) method to optimize hardware resource utilization:

  • Separation of Modules: The vision encoder and associated modules (pixel shuffle, ViR) run on dedicated "vision server" GPUs, and the LLM operates on separate "language server" GPUs.
  • Batching and Parallelization: Visual embeddings are computed in parallel and transmitted in compressed format (BF16) to the language server, which asynchronously processes autoregressive decoding.
  • Latency and Throughput: By overlapping vision and language computations, DvD achieves nearly 2× throughput alone, and up to 4.05× speedup when paired with ViR.

4. Reasoning Performance and Inference Efficiency

InternVL 3.5 demonstrates state-of-the-art multimodal reasoning metrics and superior inference speed:

Model/Variant Reasoning Improvement over InternVL3 Inference Speedup
InternVL3.5-8B +16.0% (MMMU/MathVista) Up to 4.05×
InternVL3.5-241B-A28B +16.0% (general tasks) Up to 4.05×

The combination of Cascade RL, ViR, and DvD delivers substantial quantitative gains not only in overall performance (e.g., >16% on reasoning tasks) but also in computational efficiency, supporting real-time deployment at scale.

5. Novel Capabilities: GUI Interaction and Embodied Agency

Beyond traditional vision-language tasks, InternVL 3.5 is architected and trained for new domains:

  • GUI Grounding and Interaction: Specialized training enables grounding of graphical user interface elements, supporting automated control and instruction-following in interactive environments.
  • Embodied Agency: The model demonstrates robust spatial reasoning in dynamic and multi-modal contexts, paving the way for applications in robotic perception and interactive agents.

This extension of capabilities is realized through targeted datasets and model adaptation, expanding the utility of InternVL 3.5 to embodied, agentic tasks.

6. Comparative Analysis with Leading Commercial Models

The largest InternVL3.5 configuration (241B-A28B) positions itself among the most competitive open-source MLLMs:

  • Benchmark Proximity: On comprehensive tasks, the model achieves scores within 3.9% of GPT-5—bridging the gap between open-source and state-of-the-art commercial systems.
  • Task Breadth: Performance covers general multimodal understanding, advanced reasoning, pure language tasks, and agentic interaction, reflecting broad applicability.

A plausible implication is that open-source models with modular training and deployment strategies, such as InternVL3.5, are now approaching parity with leading proprietary MLLMs.

7. Open Source Release and Research Directions

InternVL 3.5 is publicly released with all supporting code and model weights, promoting transparency and reproducibility. Its combination of efficient RL-based training, dynamic token compression (ViR), and hardware-optimized deployment (DvD) suggests promising future directions:

  • Fine-grained token and computation routing for further speedups
  • Extension to multimodal control and interactive agency domains
  • Continued narrowing of the performance gap with closed-source systems via algorithmic innovation and data scaling

InternVL 3.5 marks an advance in the design, reasoning capacity, and efficiency of large multimodal models and sets new standards for open-source AI frameworks.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube