Endogenous Visual Pre-training (EViP)

Updated 21 July 2025

Endogenous Visual Pre-training (EViP) is a visual learning strategy that leverages inherent patterns from unlabeled or weakly labeled data.
It utilizes delta tuning, mixture-of-experts integration, and progressive learning stages to efficiently align visual and language modalities.
EViP enhances data efficiency and inference speed, underpinning state-of-the-art performance in multimodal large language models.

Endogenous Visual Pre-training (EViP) refers to a family of strategies for visual representation learning in which the model draws upon intrinsic, often unlabeled, visual data structure—frequently using large-scale, noisy, or weakly labeled datasets—to learn visual priors and align with LLMs. The fundamental principle behind EViP is to maximize the extraction of visual knowledge from within the training data itself, reducing dependence on external annotation or carefully curated supervision. EViP encompasses approaches spanning self-supervised learning, multimodal and mixture-of-experts architectures, data-efficient pre-training regimes, and generative modeling of visual priors, and is recognized as a foundational technique in recent monolithic multimodal LLMs (MLLMs).

1. Foundational Principles

The core concept of EViP is to conduct visual pre-training by leveraging “endogenous” signals—the patterns, correlations, and relationships that are implicit within non-curated or weakly supervised visual data. This distinguishes EViP from traditional exogenous strategies that require manual labeling or explicit domain-specific alignment. The EViP paradigm includes:

Delta tuning: Only newly introduced visual parameters (e.g., visual experts, patch embeddings, or visual attention heads) are updated during visual pre-training, leaving most pre-trained language parameters frozen to prevent catastrophic forgetting and unstable optimization (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025).
Mixture-of-experts (MoE) design: Visual and textual data are processed by specialized experts within a loosely coupled or integrated model to preserve modality-specific capacities while enabling interaction, typically in a feed-forward or attention mechanism structure.
Progressive pre-training: EViP employs staged learning, beginning with the ingestion of massive, noisy datasets for basic concept acquisition, followed by refinement on high-quality or synthetic captions for semantic knowledge, and culminating with alignment on downstream, task-specific data.

This endogenous approach is designed to fully exploit within-data knowledge and maximize the efficiency, robustness, and modality alignment of large visual–LLMs.

2. EViP Methodologies

EViP is instantiated through a progression of learning stages, visual expert designs, and parameter tuning strategies. Notable methodologies include:

Visual parameter space augmentation: A pre-trained LLM is augmented with dedicated visual parameters, such as patch embedding layers and sets of visual experts. During pre-training, only these visual parameters (denoted as Δθ) are optimized, formalized as:

$\arg\min_{\Delta\theta} \mathcal{L}(\mathcal{F}_{\text{LLM}}(\mathbf{x}_m; \theta, \theta_v), \hat{y})$

where $\mathcal{F}_{\text{LLM}}$ uses fixed language parameters $\theta$ and optimized visual parameters $\theta_v$ . (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025)

Mixture-of-experts integration: In each transformer layer, after self-attention, a multimodal MoE route processes visual and text tokens via distinct experts:

$\text{MMoE}(x) = \begin{cases} \text{FFN}_v(x), & x \in \{\text{visual tokens}\} \ \text{FFN}_t(x), & x \in \{\text{text tokens}\} \end{cases}$

This design is extended in Mono-InternVL-1.5 to the multi-head attention mechanism, utilizing vision-specific experts for query, key, and value projections (Luo et al., 16 Jul 2025).

Progressive learning process:

1. Concept learning: Learning basic visual features from vast noisy datasets (e.g., Laion-2B, Coyo-700M). 2. Semantic learning: Refinement via higher-quality synthetic captions and increased patch counts. 3. Alignment learning: Task-specific fine-tuning (e.g., for captioning, detection, OCR) while unfreezing a small number of additional parameters to improve modality alignment. (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025)

Delta tuning: Fine-tuning restricts learning to newly introduced components, mitigating catastrophic forgetting of the pre-trained LLM’s linguistic domain and promoting stable integration of visual knowledge (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025).

3. Architectural and Algorithmic Advances

EViP models are characterized by several architectural features and algorithmic techniques:

Monolithic MLLMs: EViP underpins models where a single LLM backbone simultaneously encodes visual and linguistic content via a unified multimodal mixture-of-experts framework. Visual tokens are embedded via a patchify-and-project process and passed through modality-specific experts embedded in both the feed-forward and (in recent variants) the attention layers.
Visual attention experts: Mono-InternVL-1.5 introduces visual experts in the multi-head attention layers, enabling more effective, fine-grained visual–language alignment and improving overall model efficiency (Luo et al., 16 Jul 2025).
Efficient inference with fused kernels: For deployment, EViP-enhanced models incorporate fused CUDA kernels for efficient multimodal MoE computation, reducing run-time latency and increasing throughput while preserving inference accuracy (Luo et al., 16 Jul 2025).
Efficient data organization: EViP++ employs a “less is more” curation strategy, reducing pre-training data volumes by favoring high-quality over excessively large, noisy sets, thereby minimizing redundant computation with negligible (or positive) effects on performance.

4. Performance Outcomes and Empirical Findings

Empirical studies demonstrate that EViP-equipped architectures match or surpass the performance of both monolithic and modular MLLMs across a variety of multimodal benchmarks. Notable results include:

Mono-InternVL-1.5 achieves a +114-point improvement over Emu3 on OCRBench, and maintains competitive or superior results on VQA, MathVista, and general multimodal benchmarks (Luo et al., 16 Jul 2025).
First-token latency reductions up to 69% compared to modular architectures, attributed to the fused MoE CUDA kernel and monolithic design (Luo et al., 16 Jul 2025).
Data efficiency: EViP++ attains strong transfer performance using less than half the pre-training data required by earlier EViP iterations, indicating a high degree of robustness to over-parameterization or label noise (Luo et al., 16 Jul 2025).

The following table summarizes representative results:

Model	OCRBench Improvement	First-token Latency Reduction	Data Used
Mono-InternVL	+80 over Emu3	Up to 67%	~1.3B images
Mono-InternVL-1.5	+114 over Emu3	Up to 69%	~495M images

(Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025)

5. Data Efficiency and Cost Considerations

EViP and EViP++ emphasize scalable, efficient visual pre-training:

Reduced data requirements: EViP++ reorganizes data staging, cutting down the number of noisy pre-training samples in the first stage (from 922M to ~250M) and in the semantic refinement stage to ~150M, preserving competitive downstream results.
Efficient learning steps: The high-quality, smaller-scale pre-training data enables faster and less compute-intensive pre-training cycles.
Inference cost: Through architectural improvements such as fused CUDA kernels for MoE computation and end-to-end delta tuning, inference speed is improved alongside cost reduction at deployment scale (Luo et al., 16 Jul 2025).

6. Relationship to Prior Visual Pre-training Paradigms

EViP-based strategies contrast with previous approaches in several respects:

Unlike modular MLLMs, which often require two-stage training (separate visual and linguistic pre-training), EViP models unify the parameter space and maintain a single, monolithic pipeline for vision-language alignment.
Compared to full-parameter finetuning approaches, delta tuning shields the language backbone from catastrophic forgetting and optimization instability.
EViP’s reliance on endogenous data patterns and weak supervision sits in distinction to fully supervised or heavily task-aligned pre-training, aiming for universality and robustness across noisy and varied visual inputs.

7. Outlook and Applications

EViP has become central to state-of-the-art MLLMs. It supports scalable deployment for tasks including visual question answering, OCR, scene captioning, and vision-language reasoning, with demonstrated robustness on both standard and challenging evaluation sets (Luo et al., 10 Oct 2024, Luo et al., 16 Jul 2025).

A plausible implication is that further advances in EViP (such as broader application of MoE to attention and other architectural elements, and further data curation refinements) could lead to even more data- and compute-efficient multimodal foundation models. The design principles of freezing language parameters, progressively tuning visual experts, and relying on endogenous visual structure are likely to inform future research on robust, universal multimodal pre-training.