Multilingual Multimodal Pre-training

Updated 13 April 2026

Multilingual multimodal pre-training is a learning paradigm that integrates visual, auditory, and textual data across languages to support tasks like image–text retrieval and document understanding.
This approach leverages machine translation, rigorous data filtering, and innovative fusion techniques to build culturally diverse and robust multimodal datasets.
Empirical results indicate significant gains in cross-lingual transfer and fairness, with state-of-the-art performance on benchmarks such as IGLUE, Multi30K, and audio–text retrieval tasks.

Multilingual multimodal pre-training refers to the joint learning of representations from multiple modalities (such as vision, speech, and text) across several natural languages. The aim is to develop models that can understand, align, and transfer semantic information among images, audio, and texts in diverse linguistic contexts. This paradigm underpins key advances in cross-lingual image–text retrieval, multilingual visual question answering, document understanding, and information retrieval involving both audio and text. Recent research has catalyzed the field with several innovations in data augmentation, architecture, objective design, and efficient large-scale training.

1. Data Sourcing, Augmentation, and Quality Filtering

Robust multilingual multimodal pre-training requires access to sizable and high-quality multimodal corpora in many languages. However, such datasets are overwhelmingly English-centric. A prevailing strategy is to use machine translation (MT) to convert large English image–caption or video–subtitle datasets into multiple target languages. The Translated Data for Multilingual Multimodal Learning (TD-MML) pipeline, for instance, machine-translates Conceptual Captions into 19 languages using M2M-100, then filters low-quality outputs with metrics such as the complement of the token-to-type ratio and source–target BLEU. Thresholding on these metrics typically preserves ≈95% of translations for high-resource languages (e.g., German, French), but only ≈55–85% for typologically distant, low-resource languages (e.g., Tamil, Japanese) (Qiu et al., 2022).

Massive web crawls serve as the foundation for current practice, with language detection used to track linguistic diversity. The inclusion of translated non-English data and deliberate balancing (e.g., by mining and post-editing image–text pairs in Chinese and English for BM-6B) is shown to provide semantically richer and culturally diverse corpora (Guo et al., 2024, Nguyen et al., 2024). Further, translation augmentations coupled with re-filtering (by cosine similarity between visual/text representations) improve robustness and image-text alignment in downstream tasks (Nguyen et al., 2024). In bilingual and multilingual LMMs (large multimodal models), controlled vocabulary expansion and targeted pretraining phases further facilitate cross-script and cross-language support (Shin et al., 2024).

State-of-the-art architectures for multilingual, multimodal pre-training typically follow either a single-stream or dual-encoder approach. In the single-stream framework, visual and textual regions are fused at the input of a unified Transformer backbone initialized from a multilingual model (e.g., XLM-R), as instantiated in TD-MML and UC² (Qiu et al., 2022, Zhou et al., 2021). Multilingual tokenization (SentencePiece with 77k–250k subword units) allows unified modeling across scripts and languages (Ni et al., 2020, Zhou et al., 2021).

MAGNETO, as used in M²-Encoder, scales patch-based ViT encoders to multilingual image–text pairs, introducing shared cross-modal “VL” fusion layers for fine-grained alignment (Guo et al., 2024). Bimodal models such as CLASP leverage modality-specific encoders (e.g., frozen LaBSE/XLM-R for language; HuBERT or Wav2Vec2 plus spectrogram backbones for audio) and fuse via late or learnable gating mechanisms (Abootorabi et al., 2024).

For document and visually-rich information, LayoutXLM stacks spatial, textual, and visual embeddings within a single multilingual Transformer, augmented by spatial-aware self-attention to exploit layout (Xu et al., 2021). Lightweight plug-in adapters (“language acquisition modules”) that add <4 MB per language enable parameter-efficient expansion to new languages while the vision/text backbones remain frozen (Zhang et al., 2022). In generative multilingual LMMs, vocabulary expansion, language-specific pretraining, and instruction tuning are effective for scaling to script-diverse, multimodal prompt–response tasks (Shin et al., 2024).

3. Unified Losses and Multilingual Multimodal Objectives

Multilingual multimodal pre-training interleaves multiple objectives to jointly align vision and language across linguistic boundaries. Canonical objectives include:

Masked Language Modeling (MLM): Mask k% of tokens, predict from multimodal context, applied over multilingual corpora (Qiu et al., 2022, Zhou et al., 2021).
Masked (Visual) Region Modeling (MRM/MRTM/CMIM): Mask k% of image regions or patches, predict object classes or reconstruct features, supporting fine-grained visual grounding (Qiu et al., 2022, Zhou et al., 2021, Guo et al., 2024).
Image–Text Matching (ITM): Discriminate true (image, caption) pairs from negatives, usually via a sigmoid or InfoNCE loss (Qiu et al., 2022, Zhou et al., 2021).
Visual Translation Language Modeling (VTLM): Simultaneously mask and reconstruct token spans in parallel captions (ENG, L), reconstruct both leveraging vision, which is critical for cross-lingual alignment (Qiu et al., 2022, Zhou et al., 2021).
Multimodal Code-switched Training (MCT): Random substitution of non-English tokens into English captions, requiring the model to perform cross-modal, cross-lingual alignment (Ni et al., 2020).
Cross-modal Contrastive Loss (InfoNCE): Align pairs of (visual, text/audio) embeddings and repel mismatches, crucial in dual-encoder and retrieval-based pipelines (Guo et al., 2024, Abootorabi et al., 2024, Nguyen et al., 2024).
Auxiliary multimodal generative and matching tasks for document or sequence generation, as in CLIPTrans (Gupta et al., 2023).

Optimization nearly always employs multi-task losses with balanced or heuristically weighted summation. The inclusion of VTLM and MCT is empirically shown to be indispensable for robust zero-shot cross-lingual transfer (Qiu et al., 2022, Zhou et al., 2021, Ni et al., 2020).

4. Evaluation Protocols and Quantitative Outcomes

Downstream evaluation spans cross-lingual image–text and video–text retrieval (Recall@K), Visual QA and entailment (accuracy), multimodal document understanding (entity/relation F1), and audio–text retrieval (HITS@1, MRR, meanR). Major benchmarks include IGLUE (XVNLI, xGQA, MaRVL, xFlickr30K, WIT), Multi30K, MSCOCO, XFUND, and various in-house image–text and audio–text tasks (Qiu et al., 2022, Xu et al., 2021, Nguyen et al., 2024, Abootorabi et al., 2024).

TD-MML achieves new state-of-the-art for zero-shot transfer on non-English IGLUE tasks, with XVNLI: 64.84% vs. xUNITER’s 58.48%, xGQA: 35.95% vs. 21.72%, and MaRVL: 59.67% vs. 54.59% (Qiu et al., 2022). In retrieval, M³P and UC² raise mean-Recall on Multi30K and MSCOCO to new highs for non-English with multilingual fine-tuning (Ni et al., 2020, Zhou et al., 2021). M²-Encoder, with 6B bilingual image–text pairs, sets zero-shot benchmarks on ImageNet-CN (80.7%, +21.1% over prior SoTA) and Flickr30K/COCO in both English and Chinese (Guo et al., 2024). CLASP establishes new state-of-the-art on audio–text retrieval, surpassing ASR-based pipelines (HITS@1: 0.94 on mixed-language; ≥0.79 on French/German/Persian) despite training on just 130 h data (Abootorabi et al., 2024).

Ablations consistently show that removing cross-lingual or multimodal tasks (VTLM, MCT) sharply reduces non-English performance. Data scaling and quality filtering (e.g., grouped aggregation for contrastive loss, translation filtering) are essential for efficient scaling and performance at high dataset sizes (Guo et al., 2024, Nguyen et al., 2024). Instruction tuning and vocabulary expansion enable competitive or superior bilingual generative performance compared to English-only approaches (Shin et al., 2024).

5. Analysis of Fairness, Bias, and Coverage

Empirical studies reveal that while multilingual alignment in embedding space (individual fairness) can be achieved, cross-lingual group fairness is not generally satisfied: there are substantial disparities in zero-shot task performance (Top-1 matching, classification accuracy) between languages, especially in underrepresented or morphologically rich languages (Wang et al., 2021). For instance, even with tightly aligned representations between English and German captions, downstream retrieval accuracy differed by 4.8–22.9% depending on whether captions were literal translations or independent descriptions.

Furthermore, multimodal models can reproduce or amplify underlying social biases present in the MT, web-crawled data, or vision backbones, including race, gender, and age disparities in classification outputs (Wang et al., 2021). The use of diverse multilingual data and deliberate geographic/racial balancing in dataset construction partially mitigates these issues and enhances fairness, with models trained on translated non-English data outperforming English-only models on tasks such as GeoDE, especially in regions like Africa (Nguyen et al., 2024).

6. Limitations, Challenges, and Future Directions

Current limitations include strong dependence on high-quality MT; translation artifacts or errors can propagate noise, bias, or “translationese” style into the learned representations (Qiu et al., 2022, Nguyen et al., 2024). “Translate-only” strategies center English semantic priors, leaving concepts absent from English ungrounded; mining native non-English captions, multimodal instruction following datasets, or speech–text pairs is a needed complement (Qiu et al., 2022, Shin et al., 2024, Abootorabi et al., 2024).

Models scaling to truly universal coverage (≥5+ languages) must overcome linguistic and computational bottlenecks, including long-tail language scarcity, token/vocabulary expansion, and parameter/sharding limits in distributed contrastive training (Guo et al., 2024). Adaptive filtering, efficient language‐adapter modules, and dynamic data curation are proposed as solutions (Guo et al., 2024, Zhang et al., 2022).

Future work will explore adaptive capacity allocation (Mixture-of-Experts), stronger joint multimodal denoising objectives across modalities (vision, text, speech), and fine-grained fairness auditing. Integrating parallel multimodal data sources, leveraging multilingual instruction tuning, and enhancing alignment via cross-lingual, contrastive, and generative objectives remain active research directions (Nguyen et al., 2024, Ni et al., 2020, Shin et al., 2024).

References

"Multilingual Multimodal Learning with Machine Translated Text" (Qiu et al., 2022)
"M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining" (Guo et al., 2024)
"M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training" (Ni et al., 2020)
"Assessing Multilingual Fairness in Pre-trained Multimodal Representations" (Wang et al., 2021)
"CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval" (Abootorabi et al., 2024)
"Stop Pre-Training: Adapt Visual-LLMs to Unseen Languages" (Karoui et al., 2023)
"Multilingual Diversity Improves Vision-Language Representations" (Nguyen et al., 2024)
"X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment" (Shin et al., 2024)
"CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation" (Gupta et al., 2023)
"Generalizing Multimodal Pre-training into Multilingual via Language Acquisition" (Zhang et al., 2022)
"UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training" (Zhou et al., 2021)
"LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding" (Xu et al., 2021)
"Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-LLMs" (Huang et al., 2021)