InternVL3.5: Open-Source Multimodal AI
- InternVL3.5 is a family of open-source multimodal large language models offering enhanced reasoning accuracy, efficiency, and versatility through novel training and architecture innovations.
- It employs a Cascade Reinforcement Learning framework with a two-stage (offline MPO and online GSPO) training regimen, achieving up to a 16% gain in reasoning tasks.
- Its novel visual token routing and decoupled vision-language deployment boost inference speed by over 4× and throughput by 2× while supporting diverse agentic capabilities.
InternVL3.5 is a family of open-source multimodal LLMs (MLLMs) designed to advance versatility, reasoning accuracy, and inference efficiency within the InternVL series. Combining architectural, algorithmic, and systems-level innovations, InternVL3.5 achieves substantial gains over prior open-source models and demonstrates competitive capabilities in comparison to leading proprietary frameworks.
1. Cascade Reinforcement Learning Framework
The Cascade Reinforcement Learning (Cascade RL) framework is central to InternVL3.5’s improvements in reasoning ability. Cascade RL employs a two-stage, coarse-to-fine training regimen:
- Stage 1: Offline Reinforcement Learning leverages mixed preference optimization (MPO), where the overall training objective is a weighted sum of preference loss, quality loss, and generation loss. This approach efficiently generates high-quality rollouts and stabilizes initial convergence.
- Stage 2: Online Reinforcement Learning utilizes methods such as GSPO, which refines the model output by iteratively applying normalized advantages computed from multiple candidate responses. This alignment process sharpens reasoning through live self-generated output refinement.
This framework ensures both rapid selection against suboptimal outputs and iterative improvement in reasoning distribution, leading to up to +16.0% gain in downstream reasoning tasks such as MMMU and MathVista, even for smaller parameter regimes. Cascade RL is particularly effective at boosting reasoning in settings where conventional instruction-only or MPO-only finetuning proves insufficient.
2. Efficient Visual Token Processing via Visual Resolution Routing
To address the computational demands of high-resolution visual inputs, InternVL3.5 integrates the Visual Resolution Router (ViR):
- Dynamic Token Compression: ViR adaptively selects the number of visual tokens for each image patch according to semantic importance. Less informative patches are compressed aggressively (as few as 64 tokens), while richer regions retain more detailed representation (up to 256 tokens).
- Two-Stage Training: Consistency training requires that model outputs remain stable across compression levels. Subsequently, a binary classifier—trained using cross-entropy loss—decides on compression by evaluating loss changes for each patch.
This approach preserves nearly 100% task performance while reducing visual token counts by up to 50%, enabling a 4.05× speedup in inference relative to the preceding InternVL3.
3. Decoupled Vision-Language Deployment Architecture
InternVL3.5 introduces Decoupled Vision-Language Deployment (DvD), a systems-level strategy that separates the vision encoder and LLM onto different hardware resources (e.g., separate GPUs or servers):
- Vision Encoder: Processes image batches in a parallelizable, low-latency fashion, generating compact visual embeddings.
- LLM: Receives embeddings via high-bandwidth protocols (such as RDMA over TCP) and performs autoregressive decoding independently of vision processing latency.
This decoupling yields up to 2× throughput improvement for complex visual reasoning tasks, particularly in scenarios involving high-resolution or multi-image inputs. The separation of batch-parallel vision inference and sequential language generation reduces queuing delays and optimizes resource utilization.
4. Model Versatility and Novel Agentic Capabilities
InternVL3.5 extends functionality beyond conventional image-text tasks:
- GUI Interaction: Model variants are trained with data targeting graphical user interface elements, enabling comprehension and generation of GUI instructions, usable for automated interface navigation or manipulation.
- Embodied Agency: Model objectives and datasets accommodate agentic reasoning tasks, such as spatial navigation within virtual environments or desktop software.
- SVG Manipulation: Support for scalable vector graphics understanding and generation enables new applications in web automation and programmatic graphical design.
A plausible implication is that these capabilities position InternVL3.5 for robotics, digital assistants, and web agent workflows.
5. Performance Metrics and Benchmark Results
Quantitative assessments demonstrate significant improvements:
Model Variant | Parameters | Reasoning Gain vs Prior | Inference Speedup | Benchmark Examples | Notable Results |
---|---|---|---|---|---|
InternVL3.5-1B to 241B | 1B–241B | +10–16% | 4.05× | MMMU, MathVista | SOTA scores among open-source models, gap narrowed with GPT-5 |
InternVL3.5-241B-A28B | 241B | Competitive w/ GPT-5 | — | Multimodal/general/agentic | State-of-the-art on text, reasoning, GUI, embodied tasks |
InternVL3.5 consistently outperforms InternVL3, both in reasoning accuracy and computational efficiency (Wang et al., 25 Aug 2025). Performance evaluations on the MMMU and MathVista benchmarks demonstrate narrowed gaps with leading proprietary architectures.
6. Architecture and Release
InternVL3.5 encompasses dense and mixture-of-experts (MoE) architectures, ranging from 1B to 241B parameters. The largest configuration, InternVL3.5-241B-A28B, employs expansive vision encoding and LLMing capacity, yielding top-tier results across a spectrum of tasks.
All models, code, training recipes (including Cascade RL and ViR), and deployment instructions are publicly released, enabling full reproducibility and community extension. The open-source commitment is intended to facilitate both benchmarking and further advances in multimodal AI systems.
7. Context and Implications within the InternVL Series
InternVL3.5 synthesizes elements from previous InternVL releases: the token efficiency strategies discovered in InternVL-X (Lu et al., 27 Mar 2025), the native multimodal pre-training paradigm of InternVL3 (Zhu et al., 14 Apr 2025), and fine-tuning heuristics from the broader literature. Its innovations in reinforcement learning, adaptive visual token compression, and distributed deployment collectively set new standards for open-source MLLMs.
This suggests ongoing research will further explore modularity in token routing, dynamic agentic reasoning, and scalable systems design for multimodal models. InternVL3.5’s capabilities and public resources provide a foundation for these developments in both academic and applied machine intelligence.