Mono-InternVL-1.5: Unified Multimodal Transformer
- Mono-InternVL-1.5 is a unified multimodal large language model that fully integrates visual and textual processing within a single Transformer architecture.
- It employs delta tuning and a progressive pre-training scheme (EViP and EViP++) to enhance learning stability and reduce training costs.
- Innovations like multimodal mixture-of-experts and fused CUDA kernels significantly boost inference speed and overall efficiency compared to modular models.
Mono-InternVL-1.5 is a monolithic multimodal LLM (MLLM) designed to tightly integrate visual encoding and LLMing within a single Transformer-based architecture. It represents a significant advance in monolithic MLLM design, focusing on efficiency, stability of learning, and cost-effectiveness, while maintaining competitive performance with state-of-the-art modular counterparts. The model was developed as an evolution of Mono-InternVL, introducing novel architectural refinements, a restructured pre-training regime, and substantial system-level optimizations (Luo et al., 16 Jul 2025).
1. Architectural Principles and Monolithic Integration
Mono-InternVL-1.5 departs from the traditional modular MLLM approach—where a visual encoder and LLM are trained independently and later fused—by embedding both visual and language processing fully within a unified Transformer architecture. Visual input, , is converted into visual tokens via a lightweight patch embedding layer:
where divides the image into patches (e.g., pixels per patch), and is a learnable positional encoding.
Textual tokens, , are derived using the original LLM tokenizer. The model then concatenates the visual and textual tokens into a single multimodal sequence, , which is processed jointly through the Transformer layers.
A key architectural innovation is the multimodal mixture-of-experts (MoE) structure embedded within each Transformer layer. The MoE statically routes visual tokens to visual experts (parameterized neural modules), and text tokens to textual experts (the pre-existing feedforward networks of the LLM):
where
In Mono-InternVL-1.5, additional modality-specific attention experts are incorporated into the multi-head attention module, further enhancing modality alignment.
2. Pre-training Methodology: EViP and EViP++
The learning stability of monolithic architectures is often threatened by catastrophic forgetting and optimization instability when adapting LLMs for visual modalities. Mono-InternVL-1.5 addresses this via delta tuning and a progressive endogenous visual pre-training scheme.
Delta Tuning
Delta tuning entails augmenting the pre-trained LLM with new visual-specific parameters (visual experts in attention and FFN modules) and freezing the core LLM weights. Only the newly introduced visual parameters and patch embedding layers are updated during pre-training, thus preventing the erosion of LLMing ability.
EViP and EViP++
Pre-training proceeds in staged fashion:
- Concept Learning (S1.1):
- Trains on ~922 million noisy image-text pairs (e.g., from Laion-2B and COYO-700M).
- Only the patch embedding and visual experts are updated, with limited patch resolution.
- Semantic Learning (S1.2):
- Employs 258 million synthetic high-level image-caption pairs, generated by a strong teacher model (e.g., InternVL2-8B).
- Encourages absorption of world knowledge and semantics, with higher patch count allowed.
- Alignment Learning (S1.3):
- Uses 143 million curated samples for downstream tasks (captioning, detection, OCR).
- Additionally unfreezes the multi-head attention modules for further modality integration, at the highest image resolution.
Mono-InternVL-1.5 further incorporates EViP++: an optimized variant of endogenous pre-training. EViP++ adds more visual attention experts in both FFN and MHA modules, and reorganizes dataset use in a “less is more” fashion—reducing training data volume by 58% and prioritizing high-quality samples, while retaining or exceeding performance.
3. Mixture-of-Experts Enhancements and Efficient Inference
Mono-InternVL-1.5’s unique multimodal MoE switch operates at both attention and FFN layers. The routing is static and token-type-aware: visual tokens activate visual experts, while text tokens use standard LLM experts. Visual expert parameters are initialized from the LLM to benefit from transfer while remaining disentangled.
Inference speed, a bottleneck in MoE-based monolithic models, is drastically improved through a fused CUDA kernel. This kernel enables joint execution of modality-specific computations for both token types. The design partitions the input sequence into small blocks and conditionally activates compute threads as needed, enhancing GPU utilization relative to the default PyTorch implementation. The result is up to 69% reduction in first-token latency, and nearly 2× throughput improvement over standard implementations.
4. Performance Evaluation and Benchmarking
Extensive experiments were conducted across 15 multimodal and 4 NLP benchmarks. Key results:
- Mono-InternVL outperforms previous monolithic MLLMs on 12 out of 15 benchmarks.
- In OCR-centric evaluations (e.g., OCRBench), Mono-InternVL achieves a +114 point gain over Emu3.
- On visual question answering tasks (TextVQA, DocVQA, InfoVQA), the model demonstrates robust OCR/text recognition.
- When compared to modular models such as InternVL-1.5 and Qwen2VL, Mono-InternVL-1.5 matches or slightly surpasses them on several benchmarks, despite lower active parameter counts (1.8B vs. 2B+).
- Ablation studies suggest that delta tuning, additional visual experts, and EViP++ each contribute incrementally to downstream performance.
The following table summarizes key efficiency metrics:
Model | Activated Params | First-token Latency Reduction | Throughput Improvement |
---|---|---|---|
Mono-InternVL-1.5 | 1.8B | up to 69% vs. modular | ~2× vs. PyTorch baseline |
InternVL-1.5 (mod) | 2B+ | – | – |
5. Practical Implications and Use Cases
The architecture and training methodology of Mono-InternVL-1.5 facilitate deployment scenarios where latency, training/inferential cost, and integration simplicity are critical. Key applications include:
- Multimodal dialogue and question answering, combining scene/text/image understanding with language generation.
- Document image OCR and complex scientific/mathematical reasoning from visual inputs.
- Real-time, high-throughput multimodal assistants and embedded systems where low latency is essential.
The reduction in training data and parameter update scope from EViP++ allows for faster retraining and domain adaptation with reduced resource requirements, broadening accessibility to research and commercial deployment.
6. Comparison with Modular Counterparts and Related Work
Mono-InternVL-1.5 marks a departure from the modular MLLM paradigm as represented by InternVL-1.5, which uses a powerful vision encoder (InternViT-6B) and LLM with a connector module. Compared to this separation, the monolithic approach:
- Delivers equivalent or slightly improved multimodal benchmark performance.
- Achieves pronounced advantages in inference speed (notably first-token latency).
- Preserves language skills via frozen LLM weights and avoids catastrophic forgetting.
- Requires fewer active parameters owing to shared weights and selective expert activation.
- Reduces hardware deployment complexity, especially in environments where module integration or ensemble strategies are impractical.
7. Future Directions
Potential future research includes:
- Exploration of alternative or adaptive routing strategies within the Mixture-of-Experts framework to further enhance efficiency and modeling capacity.
- Expansion of EViP++ to include richer, more diverse, or multilingual data.
- Further optimization of fused kernels and broader adoption for LLMs with multimodal capabilities.
- Investigation into more compact variant architectures, possibly for low-resource or edge deployments, leveraging the monolithic paradigm.
In summary, Mono-InternVL-1.5 establishes a new paradigm for efficient, high-performance monolithic multimodal LLMs by unifying vision and LLMing within a single Transformer, employing progressive delta-tuned pre-training, and integrating system-level optimizations that reduce operational cost without sacrificing accuracy (Luo et al., 16 Jul 2025).