Mono-InternVL: Monolithic MLLM Innovation

Updated 21 July 2025

Mono-InternVL is a unified multimodal large language model that directly fuses visual encoding with language decoding in one Transformer architecture.
It employs targeted delta tuning and a multimodal mixture-of-experts structure to enhance visual learning while preserving pre-trained language abilities.
The Mono-InternVL-1.5 variant demonstrates benchmark improvements, reduced training data needs, and lower latency via fused CUDA optimizations.

Mono-InternVL represents a line of monolithic Multimodal LLMs (MLLMs) that directly integrate visual encoding and language decoding into a single Transformer-based architecture. Diverging from modular approaches that align a distinct visual encoder with a pre-trained LLM, Mono-InternVL and its successors introduce an embedded visual parameter space, facilitating stable visual acquisition through targeted delta tuning and a multimodal mixture-of-experts (MMoE) structure. The development of Mono-InternVL, culminating in Mono-InternVL-1.5, addresses persistent challenges in monolithic MLLMs such as unstable optimization and catastrophic forgetting, while offering substantial efficiency benefits and competitive, state-of-the-art performance on benchmark tasks (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025).

1. Architectural Framework: Monolithic vs. Modular MLLMs

Mono-InternVL is architected as a unified MLLM, integrating both image and text processing within a single Transformer decoder. The image is first “patchified” and embedded via a dedicated patch embedding layer and MLP, yielding visual token representations: $x_v = \mathrm{MLP}(\mathrm{PatchEmbed}(I) + \mathrm{PE})$ where $I$ is the image and PE denotes learnable positional encoding. Text is tokenized using the standard tokenizer: $x_t = \mathrm{Tokenizer}(T)$ The model concatenates visual and textual embeddings ( $x_m = \mathrm{concat}(x_v, x_t)$ ), routing this multimodal sequence into each Transformer layer. A two-stage process per layer is employed:

Multi-Head Attention (MHA): Applied to the normalized sequence.
Multimodal Mixture-of-Experts (MMoE): Tokens are routed to either a visual feedforward network ( $\mathrm{FFN}_v$ ) or textual feedforward network ( $\mathrm{FFN}_t$ ), according to modality: $\mathrm{MMoE}(x) = \begin{cases} \mathrm{FFN}_v(x) & \text{if } x \in x_v \ \mathrm{FFN}_t(x) & \text{if } x \in x_t \end{cases}$

In Mono-InternVL-1.5, the MMoE design is extended to also include modality-specific attention heads, employing distinct linear projections ( $\mathrm{Linear}_v$ , $\mathrm{Linear}_t$ ): $q = \begin{cases} \mathrm{Linear}_v(x) & \text{if } x \text{ is visual}\ \mathrm{Linear}_t(x) & \text{if } x \text{ is text} \end{cases}$ This parameter isolation enables effective visual learning without disrupting pre-trained linguistic capacity, a core difference from modular baselines such as InternVL-1.5, where modality fusion is external (Luo et al., 16 Jul 2025).

2. Visual Parameter Space and Delta Tuning

A fundamental principle throughout all Mono-InternVL releases is the introduction of a “visual parameter space” embedded within the existing frozen LLM. Visual modules—including patch embeddings, MMoE visual experts, and (in 1.5) visual attention experts—are initialized separately, then delta-tuned on large-scale multimodal data. The rest of the LLM remains frozen, mitigating catastrophic forgetting of language skills while efficiently acquiring visual abilities: $\operatorname{argmin}_{\Delta \theta} \mathcal{L}(\mathcal{F}_{\mathrm{LLM}}(x_m; \theta, \theta_v), \hat{y})$ where $\theta$ are frozen LLM parameters and $\theta_v$ are trainable visual parameters.

This strategy ensures that visual learning is primarily absorbed by the specialized visual submodules, while language knowledge—critical for performance on text-rich benchmarks—remains intact (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025).

3. Endogenous Visual Pre-training (EViP and EViP++)

Mono-InternVL introduces a progressive visual pre-training regimen termed “Endogenous Visual Pre-training” (EViP), designed to maximize the capacity of added visual experts:

Step 1 (Concept Learning): The model is exposed to $\sim$ 922M noisy image-text pairs (e.g., from Laion-2B, Coyo-700M) with simple captioning prompts, focusing on basic object and scene recognition.
Step 2 (Semantic Learning): Using $\sim$ 258M synthetic, richer captions (generated by InternVL2-8B), the model internalizes complex world knowledge and relationships.
Step 3 (Alignment Learning): High-quality, task-specific datasets (captioning, detection, OCR) are used to fine-tune alignments for specific downstream benchmarks. Here, visual attention layers are also unfrozen.

Mono-InternVL-1.5 enhances this by adopting “EViP++.” This variant increases visual expert capacity (notably adding visual attention experts), re-organizes the pre-training curriculum for maximal efficiency, and targets a significant reduction in total training and inference costs (Luo et al., 16 Jul 2025).

4. Multimodal Mixture-of-Experts (MMoE) Structuring

The multimodal mixture-of-experts architecture is central to model stability and efficiency. Each Transformer layer routes tokens by modality after MHA and normalization: $x_m^{(l)} = x_m^{(l')} + \mathrm{MMoE}(\mathrm{RMSNorm}(x_m^{(l')}))$ with the routing determined per-token. Mono-InternVL-1.5 further splits attention module projections to maintain sharper modality-specific specialization. During deployment, a fused CUDA kernel for the MMoE block enables efficient operation by processing blocks containing both token types in parallel (Luo et al., 16 Jul 2025).

5. Empirical Benchmarks and Efficiency

Extensive evaluation across 15–16 multimodal benchmarks—including VQA, OCRBench, mathematical reasoning, and general vision–language tasks—demonstrates that Mono-InternVL and Mono-InternVL-1.5 outperform existing monolithic MLLMs on the majority of metrics. Reported outcomes include:

Up to +114-point improvement over Emu3 on OCRBench.
Comparable multimodal performance relative to modular InternVL-1.5, while reducing first-token latency by up to 69%.
Mono-InternVL-1.5 achieves similar accuracy using only 42% of the training data of the original Mono-InternVL, reflecting significant data efficiency gains (Luo et al., 16 Jul 2025).

The fused CUDA kernel implementation for MoE operations results in 1.7–2.3× faster MoE computation compared to standard PyTorch implementations, further substantiating the claim of reduced training and inference costs.

Model	Benchmarks Outperformed	Data Tokens Used	First-Token Latency Reduction
Mono-InternVL	12/15	1.1B	67%
Mono-InternVL-1.5	12/15	0.5B	69%

6. Comparative Analysis with Modular Paradigms

Where modular MLLMs (such as InternVL-1.5 and its family) deploy discrete, sometimes independently pre-trained, visual and language modules joined through connectors or aligners, Mono-InternVL’s monolithic architecture allows direct, sparse, and parameter-efficient fusion. The strategic use of delta tuning and modality-specific expert routing ensures that high-quality language capabilities are preserved. Empirical data shows that this enables comparable or even superior performance on complex benchmarks—especially for tasks requiring rapid generation (reduced “first token” latency and overall inference time).

A principal implication is that monolithic MoE-based MLLMs are now viable for deployment in latency-sensitive and resource-constrained environments, without sacrificing accuracy or versatility (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025).

7. Efficiency, Limitations, and Future Prospects

Mono-InternVL’s advances in training and inference efficiency are attributable to both architectural innovations (expert isolation, fused operations) and improved curriculum design (EViP++'s quality-over-quantity principle). The model achieves strong task performance using less data, a key advantage for resource-aware training regimes.

However, the monolithic approach also entails certain trade-offs:

Potential upper bounds on model scale relative to modular systems, which more easily accommodate arbitrary backbone replacements.
Reliance on carefully tuned delta-training to avoid subtle errors in visual–text alignment.

Future directions outlined in Mono-InternVL-1.5 include scaling up the visual parameter space, extending the method to richer modalities, and integrating further efficiency enhancements (e.g., dynamic visual token pruning, improved modality fusion techniques).

Conclusion

Mono-InternVL and Mono-InternVL-1.5 mark a significant step in monolithic MLLM research, effectively embedding visual experts and adopting delta-tuned, progressive visual pre-training to achieve efficient, broadly competitive vision–language integration. Their mixture-of-experts architecture, improved with visual attention experts and optimized CUDA kernels, provides both practical and theoretical value by delivering strong benchmark performance, low latency, and reduced training cost, thus shaping new paradigms in integrated multimodal AI systems (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025).

PDF Markdown Chat (Pro)

References (2)

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training (2024)

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Mono-InternVL.