Maya: An Instruction Finetuned Multilingual Multimodal Model (2412.07112v1)

Published 10 Dec 2024 in cs.CV and cs.CL

Abstract: The rapid development of large Vision-LLMs (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.

Summary

The paper introduces Maya, a multilingual multimodal model that bridges language gaps in vision-language tasks using a curated image-text dataset across eight languages.
The paper employs innovative prompt engineering and toxicity filtering to ensure a culturally sensitive and bias-minimized training process.
The paper demonstrates superior performance on multilingual benchmarks compared to English-dominant models, underscoring its robust cross-lingual capabilities.

Overview of "Maya: An Instruction Finetuned Multilingual Multimodal Model"

The paper presents Maya, an innovative approach to advancing Vision-LLMs (VLMs) by incorporating multilingual and multimodal capabilities. Existing VLMs, while demonstrating impressive outcomes on various benchmarks, predominantly cater to high-resource languages, thereby limiting cultural comprehension and language diversity in vision-language tasks. Maya aims to bridge these gaps by introducing an open-source model that supports eight languages and incorporates cultural sensitivity into its core functionalities.

Key contributions of the research include the development of:

A multilingual image-text pretraining dataset comprising 558,000 images across eight languages, structured to enhance vision-language understanding.
A novel version of this dataset, meticulously filtered to eliminate toxicity, ensuring the dataset is safe and culturally respectful.
An mVLM that outperforms previous models like PALO-7B on multilingual benchmarks, particularly offering a viable alternative to English-dominant models.

Detailed Analysis

The necessity for Maya arises from the limitations of current VLM datasets, which primarily exist in English, thus hindering cross-linguistic applicability and propagating biases. The model leverages a redesigned dataset based on the LLaVA pretraining set, enhanced through rigorous toxicity analysis and language translation enhancements. Toxic contents are identified and removed using tools like LLaVAGuard and Toxic-BERT, facilitating the training of more equitable AI systems.

The methodology entails robust prompt engineering and a hybrid translation strategy, ensuring linguistic balance and data quality across languages. The pipeline encompasses:

Effective prompt creation via BERT-based models for translation verification.
A scalable dataset generation approach using Aya 35B for high-quality, cross-lingual data.

Performance Evaluation

Maya's architecture builds upon LLaVA 1.5, integrating the Aya-23 8B model for multilingual text processing coupled with SigLIP for visual understanding. This combination allows Maya to attend to cultural subtleties and nuanced reasoning tasks, performing proficiently on multilingual evaluations such as LLaVA-Bench-In-The-Wild. Critical benchmarks demonstrate Maya's comparable or superior performance relative to the existing models within its parameter size class, particularly in language-diverse scenarios.

While Maya's supremacy is evident in five out of the eight primary languages it supports, it also excels in languages like Arabic, showcasing the model's potential in addressing linguistic specificities through effective translations. The research suggests that Maya does not compromise on performance by removing toxicity, as seen in the comparative benchmarks across various visual question-answering and text generation tasks.

Future Directions

The implications of this research extend toward developing comprehensive multilingual datasets that reflect cultural heterogeneity and ameliorate representation disparities. Future efforts will focus on expanding the Maya model to include a broader array of languages, thus enriching its pretraining dataset. Additionally, further research into alternative projection matrices and the unfreezing of layers during nuanced fine-tuning could enhance cross-modal alignment, magnifying the model's efficacy across diverse linguistic domains.

The paper sets the stage for future advancements in VLMs, advocating for wider language coverage and the reinforcement of safety across AI datasets. This positions Maya as a pivotal resource for researchers aiming to elevate AI inclusivity while preserving linguistic integrity and safety.