Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages (2410.16153v2)

Published 21 Oct 2024 in cs.CL and cs.CV

Abstract: Despite recent advances in multimodal LLMs (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.

Authors (10)

Xiang Yue (72 papers)
Yueqi Song (11 papers)
Akari Asai (35 papers)
Seungone Kim (34 papers)
Jean de Dieu Nyandwi (2 papers)
Simran Khanuja (19 papers)
Anjali Kantharuban (4 papers)
Lintang Sutawika (14 papers)
Sathyanarayanan Ramamoorthy (4 papers)
Graham Neubig (342 papers)

Citations (3)

View on Semantic Scholar

Summary

Overview of "Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages"

The paper "Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages" introduces Pangea, a novel multilingual multimodal LLM that represents a significant advancement in the development of inclusive and culturally diverse AI systems. Acknowledging the limitations of existing multimodal LLMs, which predominantly focus on Western-centric datasets, this research proposes a comprehensive approach to address data and cultural diversity gaps in LLM training and evaluation.

Pangea is trained on a newly curated dataset, PangeaIns, comprising 6 million instruction samples across 39 languages. The paper also presents PangeaBench, an evaluation suite designed to rigorously test the capabilities of multilingual models across a variety of languages and tasks. The results showcase Pangea's superior performance in multilingual and culturally diverse settings, significantly surpassing existing state-of-the-art open-source models.

Dataset Construction

The researchers developed PangeaIns by utilizing three key strategies to ensure linguistic and cultural diversity:

Machine Translated Instructions: This strategy extended high-quality English instructions to other languages using machine translation models such as Gemini 1.5 Pro. The translation quality was maintained through post-processing to resolve inconsistencies.
Multicultural Understanding Instructions: Recognizing that merely translating Western-centric datasets would not suffice, the authors curated a dataset of culturally diverse images using LAION-Multi. They then generated culturally relevant captions and instructions for these images, ensuring that cultural nuances were adequately captured.
Existing Multilingual Instructions: By incorporating several open-source multilingual datasets, PangeaIns has enriched its coverage, further promoting linguistic and cultural diversity.

Evaluation with PangeaBench

To assess Pangea's performance, the authors introduce PangeaBench, a suite involving both multimodal and text-only tasks. The suite includes datasets for tasks such as Multimodal Chat, Captioning, Cultural Understanding, and more, across 47 languages. Notably, the addition of benchmarks like xChatBench allows for precise evaluation of open-ended multimodal interactions and understanding of diverse cultural contexts.

Experimental Results

The results demonstrate Pangea's outstanding performance in both English and multilingual settings. It shows remarkable improvements, especially in multilingual tasks, over contemporary open-source models. For example, Pangea outperformed its peers by 7.3 points on English tasks and by 10.8 points on multilingual tasks. However, when compared to proprietary models like GPT4o, there remains room for advancement, particularly in complex reasoning scenarios.

Implications and Future Directions

Pangea facilitates the development of LLMs that are not only proficient in multiple languages but also culturally inclusive. The open-sourcing of Pangea, along with its data and evaluation benchmarks, provides a valuable resource for further research in multilingual AI.

Looking forward, the paper identifies challenges such as improving support for low-resource languages and enhancing multilingual OCR capabilities. Addressing these challenges could lead to more equitable AI systems that break free from linguistic and cultural biases prevalent in current technologies.

In conclusion, this research offers meaningful contributions to the field of multilingual AI by establishing new benchmarks and methodologies that prioritize cultural and linguistic diversity. As the community builds upon these foundations, we can anticipate further innovations that promote comprehensive, inclusive global AI solutions.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/AkariAsai/status/1882226864917500351

https://twitter.com/DavidLemayian/status/1850534422782554418

https://twitter.com/arXivGPT/status/1849171619589259271

YouTube

Show All Videos