Large Pre-trained Multi-modal Models

Updated 4 August 2025

Large pre-trained multi-modal models are deep learning systems designed to process and align data from different modalities using transformer architectures and massive, curated datasets.
They employ advanced scaling techniques such as mixed-precision training and Mixture-of-Experts layers to handle billions of parameters, improving performance on tasks like image captioning and cross-modal retrieval.
State-of-the-art architectures utilize both early and late fusion strategies with cross-modal attention and parameter-efficient adaptations, achieving competitive benchmark results and broad real-world applications.

Large pre-trained multi-modal models are deep learning architectures characterized by their ability to process, align, and generate data across multiple modalities—such as images, text, audio, and video—at scale. These models, typically built on transformer architectures, are trained on extensive datasets containing paired (and often single-modality) examples, enabling them to learn general representations and interactions between modalities. The resulting models underpin state-of-the-art performance in a wide range of tasks, from image captioning and cross-modal retrieval to multi-modal dialogue and content generation, and have begun to close performance gaps on both supervised and zero-shot benchmarks.

1. Data Foundations and Scaling

Large pre-trained multi-modal models rely on massive multimodal corpora for training. For instance, the M6 model is pre-trained on the M6-Corpus, which comprises nearly 1.9 terabytes of images and 292 gigabytes of text, including encyclopedia entries, forums, Common Crawl, and e-commerce sites, with careful filtering for both textual and visual quality (Lin et al., 2021). Datasets are often designed for both paired (image–text, video–text) and unpaired (single modality) samples. Real-world data sources often exhibit “weak correlations” between modalities (as in the RUC-CAS-WenLan dataset underlying BriVL (Huo et al., 2021)), necessitating either careful curation or robust learning techniques to handle loosely matching pairs and minimize the effect of noise.

Scaling models to billions (or even hundreds of billions) of parameters, as in M6-10B/100B and InternVL3-78B, is integral for leveraging the richness and heterogeneity of such data. To enable this, advanced optimization—such as mixed-precision training, activation checkpointing, model/data parallelism, and, at extreme scale, sparse Mixture-of-Experts (MoE) layers—is routinely employed. For MoE scaling, token-specific routing determines which subset of lightweight “experts” processes each input, often according to:

$p(x)_i = \frac{\exp(g(x)_i)}{\sum_j \exp(g(x)_j)}, \quad y = \sum_{i \in \text{Top-k}} p(x)_i \cdot E_i(x)$

where $g(x)_i$ is the gating score for the $i$ th expert, and $E_i(x)$ is its response (Lin et al., 2021).

2. Architectures and Pretraining Paradigms

Transformer-based encoders and encoder–decoders dominate in multi-modal modeling (Lin et al., 2021, Wang et al., 2023). Approaches differ in modality integration:

Single-stream (or early-fusion): Inputs from all modalities are concatenated (after appropriate patchification/embedding) and jointly processed (e.g., in Oscar, VisualBERT).
Cross-stream (late-fusion/dual-tower): Separate modality-specific encoders produce embeddings, later aligned or fused (e.g., CLIP, BriVL (Huo et al., 2021)).

Innovations include variable positional encoding schemes—such as Variable Visual Position Encoding (V2PE) in InternVL3—that support long or high-resolution visual contexts by fractionalizing position increments for visual tokens, permitting more tokens per positional window while preserving spatial relationships (Zhu et al., 14 Apr 2025).

Pretraining typically comprises multitask objectives that span unimodal and cross-modal learning:

Masked language modeling (MLM), masked object/region modeling, masked modality prediction, and cross-modal matching (e.g., image–text, audio–video).
Cross-modal contrastive learning (as in BriVL, which adapts MoCo to maximize negative sampling efficiency for weakly correlated image–text pairs).
Autoregressive language modeling that jointly emits language and discrete visual tokens (as in image generation tasks; M6 employs VQGAN/VQVAE-discretized image codes for text-to-image generation).

A notable pretraining innovation in Croc is the introduction of a cross-modal comprehension stage involving dynamically learnable prompt token pools and the use of the Hungarian algorithm for optimal alignment of masked visual positions (Xie et al., 18 Oct 2024), together with mixed bidirectional–unidirectional attention masking for joint representation learning.

3. Alignment, Tuning, and Adaptation Mechanisms

Alignment across modalities is central for robust representation and generalization. Architectures employ varied mechanisms:

Cross-modal attention: Bidirectional attention between modalities (e.g., extracting context from both audio and text in emotion recognition (N, 2021)), hierarchical fusion (MuDPT’s deep bi-directional prompt fusion (Miao et al., 2023)), and specialized cross-modal adapters (e.g., DG-SCT for audio–visual tasks (Duan et al., 2023)).
Parameter-efficient adaptation: Prefix-tuning introduces learnable prefix tokens without altering model parameters, effectively preserving the pre-trained feature space (Kim et al., 29 Oct 2024); PT-PEFT further fuses this with LoRA or Adapters in a two-stage process, which SVD analysis shows avoids “rank collapse” (preserves the original basis vectors) and improves downstream performance.

Cost efficiency and modular assembly of capabilities across pre-trained variants is exemplified by SoupLM, which linearly interpolates (via global or modular learnable “soup” coefficients) the weights of models specialized for distinct domains (e.g., conversation, vision–language), producing an ensemble in “interpolation space” with minimal compute and no extra inference cost (Bai et al., 11 Jul 2024).

4. Benchmark Performance and Applications

Large pre-trained multi-modal models have achieved or surpassed state-of-the-art performance on an extensive array of benchmarks. Examples include:

Vision–language understanding: InternVL3-78B attains 72.2 on MMMU, competitive with proprietary ChatGPT-4o and Claude 3.5 Sonnet (Zhu et al., 14 Apr 2025).
Zero-shot learning: Models employing feature extractors such as CLIP (visual), CLAP (audio), and their associated text encoders support zero- and generalized zero-shot classification via joint embedding space nearest-neighbor matching, yielding substantial improvements over earlier architectures (e.g., UCF-GZSL harmonic mean of 55.97% vs. ≤42.67% prior SOTA (Kurzendörfer et al., 9 Apr 2024)).
Domain-specific multi-modal reasoning: TransGPT-MM, trained for transportation, increases multi-modal task accuracy from 27.05% (VisualGLM-6B) to 67.21%, and delivers text-only gains in traffic engineering tests (Wang et al., 11 Feb 2024).
Clinical and real-time systems: The Rene framework combines a fine-tuned Whisper for respiratory audio with EMR-based LightGBM outputs, achieving up to 23% improvement in disease detection F1/harmonic scores and supports real-time inference on embedded systems (Zhang et al., 13 May 2024).

Table: Representative Model Capabilities

Model	Modalities	Key Innovations	Notable Benchmarks
M6	Image, Text	Unified encoder–decoder, MoE	Image captioning, text-to-image
BriVL	Image, Text	MoCo-based contrast; weak correlation	Retrieval, captioning
InternVL3	Image, Text, Video	Native multimodal pretraining, V2PE, MPO	MMMU (72.2)
SoupLM	Any (LLMs, MLLMs)	Weight interpolation (soup)	MMLU, MMMU, LLaVA-Bench
TransGPT-MM	Image, Text (Trans)	BLIP2-Qformer visual alignment	Traffic Q&A, signs
Rene	Audio, Text (EMR)	Whisper adaptation, Conformer/GRU, Edge AI	SPRSound, ICBHI

Models are evaluated in diverse scenarios: image/video captioning, VQA, event localization, next-item recommendation, action recognition, emotion and facial action detection, and generative content synthesis.

5. Open Challenges and Research Directions

Despite rapid progress, several technical challenges persist:

Heterogeneous modality alignment: Integrating modalities with varying structures and statistics (e.g., sensor time series, 3D, audio) demands novel architectures and loss designs (cf. Q-formers, variable positional encodings).
Scalable and efficient adaptation: Parameter and compute efficiency for domain transfer, efficient fine-tuning (LoRA, QLoRA, prefix-tuning), and incremental continual learning remain areas of active research (Han et al., 8 Oct 2024).
Preservation versus adaptation: There is ongoing investigation into how to maximize adaptation (task fit) without catastrophic degradation of the pre-trained representation space (cf. SVD-based “rank collapse” analysis in PT-PEFT (Kim et al., 29 Oct 2024)).
Cross-domain and open-world generalization: Developing training and evaluation regimes that extend beyond vision–language to include rich multi-modal, multi-domain scenarios—the approach used in UniM²Rec for recommendation, or InternVL3’s unified autoregressive objective—remains an open area.
Open-source and reproducibility commitments: Recent models, e.g., Croc, SoupLM, and InternVL3, release both data and code to facilitate independent benchmarking and research extension, helping to standardize evaluation and lower reuse barriers.

Promising future avenues identified include more sophisticated prompt-based and instruction tuning schemes, unified multi-modal representation learning for novel modalities, fine-grained instance/region pretraining, and leveraging structured and unstructured external knowledge bases (Wang et al., 2023, Han et al., 8 Oct 2024, Zhu et al., 14 Apr 2025).

6. Impact and Significance

The shift toward native multi-modal pre-training (as in InternVL3 (Zhu et al., 14 Apr 2025))—as opposed to retrofitting existing LLMs—represents a convergence of strong linguistic and perceptual intelligence in a single model. This supports a unified semantic space for multi-modal generation and reasoning, reducing alignment complexity and improving performance across a range of tasks. Innovations such as V2PE and advanced preference optimization have led to models that are state-of-the-art on open and proprietary evaluation datasets, all while maintaining high efficiency and versatility.

These models serve as foundational systems for multi-modal AI, powering both general-purpose applications (retrieval, dialogue, recommendation) and specialized domains (medical, transportation, affective computing). Their open release fosters reproducibility and accelerates progress toward generalized intelligent agents capable of perceiving and reasoning about the world in richly multi-modal contexts.