M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

Published 31 Mar 2026 in cs.CL and cs.AI | (2603.29467v1)

Abstract: This paper presents a Multilingual Vision LLM, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a multilingual extension to VLLMs by leveraging translated data, achieving 36% accuracy on the MMMU benchmark.
It employs a three-stage training pipeline that combines image-caption alignment with translated and parallel corpora for robust cross-lingual reasoning.
Results demonstrate significant improvements over SOTA models in multilingual contexts while maintaining strong English performance.

M-MiniGPT4: Advancing Multilingual Vision-Language Large Models via Translated Alignment

Motivation and Goals

The paper "M-MiniGPT4: Multilingual VLLM Alignment via Translated Data" (2603.29467) addresses the significant monolingual bias—particularly towards English—present in existing Vision-Language Large Models (VLLMs). With historically English-centric datasets and alignments, much of the world's population is not adequately served by recent VLLM advances. This work systematically extends VLLM architecture to the multilingual setting, explicitly targeting both high- and low-resource languages through synthetic translations and alignment with parallel corpora.

Training Strategy and Data Construction

The M-MiniGPT4 framework refactors the original MiniGPT4 pipeline by replacing the English-dominant Vicuna LLM with Llama 3, offering robust intrinsic multilingual capabilities. The architecture retains the classical vision-language alignment paradigm but expands it across 11 languages—Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, Japanese, Korean, and English.

The three-stage training pipeline is meticulously tuned for multilinguality:

Stage 1: Vision-Language Modality Alignment: This initial alignment employs standard image-caption datasets (Conceptual Captions, SBU, LAION). Empirical analysis indicates that purely expanding pretraining at this stage does not significantly affect downstream multilingual reasoning.
Stage 2: Multilingual Multimodal Supervised Tuning: The authors introduce translated versions of datasets like LLaVA-Instruct and Cambrian Image (CI), via NLLB translations, and the pre-existing PALO dataset. This enables parallel exposure across all target languages.
Stage 3: Multilingual Alignment Supervision: At this crucial phase, the model incorporates parallel text corpora such as Flores and XStoryCloze together with translated and native text-only Cambrian Text (CT). This is demonstrated to further strengthen cross-lingual VLU by leveraging semantic consistency present in carefully constructed translation pairs.

For evaluation, the authors also generate a multilingual variant of the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark via translation and confirm translation quality via back-translation validation tests, ensuring minimal informational degradation.

Experimental Results

Empirical results confirm the efficacy of translated and parallel data augmentation in boosting multilingual VLLM performance. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art (SOTA) VLLMs of comparable parameter count. The model demonstrates particularly pronounced improvements over PALO (from 13.12% to 33.45% on MMMU Multi), confirming that both the training set scale and the diversity enabled by the new translation pipeline materially impact multilingual vision-language reasoning.

Relative to Qwen-VL 2.5—a strong open-source competitor—M-MiniGPT4 shows superiority in multilingual test settings (33.45% vs. 25.46% on MMMU Multi). However, Qwen-VL 2.5 maintains an edge on English-dominant benchmarks, attributed to larger-scale instruction tuning and greater English-centric data diversity in its own pretraining.

Key ablation studies reveal that:

Incorporation of translated Cambrian Image (CI M) consistently boosts reasoning accuracy across all languages.
Combining translated multimodal data (CI M) with parallel text data (MText, CT M) in later training stages demonstrates optimal cross-lingual performance.
The use of translated vision-language datasets produces no significant regressions in core English tasks.

Implications and Limitations

By open-sourcing both the multilingual datasets and the trained models, this work provides a foundation for further research into low-resource and cross-cultural VLLM benchmarks and downstream tasks. The demonstrated plug-and-play compatibility with advances in text-only LLMs and visual encoders further amplifies the community value of this approach.

However, several limitations are identified:

Translation Nuance: Reliance on MT output is known to miss subtleties in cultural and idiomatic content, likely introducing both fidelity loss and bias, especially in lower-resource language pairs.
Language Scope: Coverage is restrained to 11 languages. The long tail of linguistic diversity—including morphologically rich, low-resource, and non-Latin scripts—remains a challenge.
Bias Inheritance: Source LLM and vision model biases are not specifically mitigated, and error propagation is likely in non-parallel, translated data regimes.
Evaluation Scope: The benchmarks used offer strong coverage of reasoning, but further work is necessary to probe for domain-specific and culturally loaded VLU phenomena.

Prospects for Future Research

The methodology demonstrates that translation-augmented and parallel-aligned multitask training pipelines are feasible and effective at scaling VLLMs to broader populations. Immediate next steps should include construction and curation of natively multilingual multimodal corpora, nuanced multilingual/cross-cultural evaluation protocols, and increased emphasis on the mitigation and detection of learned biases. There is also significant opportunity for compositional benchmarking in code-switched and mixed-language vision-language tasks.

Conclusion

M-MiniGPT4 (2603.29467) situates itself as an empirical advance in developing robust, plug-and-play multilingual VLLMs, showing strong numerical gains in cross-lingual multimodal reasoning via a judicious mix of translation-centric and parallel alignment training. The results validate the extension of LLM-driven multimodal frameworks to multilingual scenarios, setting a precedent for further developments in inclusive and equitable AI. The released resources are expected to catalyze research in less-represented languages and advance evaluation methodology for multilingual vision-language tasks.

Markdown Report Issue