Behind Maya: Building a Multilingual Vision Language Model (2505.08910v2)

Published 13 May 2025 in cs.CV and cs.CL

Abstract: In recent times, we have seen a rapid development of large Vision-LLMs (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.

Summary

The paper introduces Maya, an open-source multilingual VLM that tackles English-centric biases by using a newly created 4.4M sample, eight-language dataset.
It employs a hybrid translation and prompt engineering pipeline to convert 550K English samples into high-quality multilingual data across diverse cultural contexts.
Maya demonstrates competitive performance on multilingual benchmarks, outperforming similar 7B models and nearing 13B models on various vision-language tasks.

The paper "Behind Maya: Building a Multilingual Vision LLM" (2505.08910) introduces Maya, an open-source Multilingual Vision-LLM (VLM) designed to address the performance limitations of existing VLMs on low-resource languages and diverse cultural contexts. The authors highlight that most VLMs are trained on English-centric datasets, leading to poor generalization to other languages and cultures. Maya aims to bridge this gap by providing capabilities in eight languages.

The core contributions of the paper are:

A new multilingual image-text pretraining dataset: This dataset is based on the original LLaVA dataset but expanded to include samples in Chinese, French, Spanish, Russian, Hindi, Japanese, Arabic, in addition to English, totaling 4.4 million samples equally distributed across these eight languages.
A multilingual VLM (Maya): Trained on this new dataset, Maya demonstrates improved performance in understanding cultural and linguistic nuances in vision-language tasks compared to previous models like PALO-7B on multilingual benchmarks.

Practical Implementation Details: Dataset Creation

Creating the multilingual dataset was a key step. The authors built upon the 550K English samples from the LLaVA pretraining dataset. Their approach involved a hybrid translation method using a pipeline that integrates multiple LLMs like GPT-4o, Gemini, Claude, and specifically, Aya 35B for translation.

The process is depicted in Figure 1 and involves:

Sampling and Initial Translation: Diverse English samples were selected and initially translated into seven target languages (Chinese, French, Spanish, Russian, Hindi, Japanese, Arabic) using Google Translate.
Cascaded Translation and Verification: The translated samples underwent back-translation into English. Human review and verification with LLMs (Claude, Gemini, GPT) were used to ensure translation quality and generate prompt engineering data.
Prompt Engineering: The authors evaluated different prompt templates for translation quality using BLEU and N-gram scores across the languages. They identified a specific template, "Preamble 6," as consistently producing the highest scores and adopted it for large-scale translation. This template provides specific instructions and examples to the translation model (Aya 35B).
Scalable Dataset Generation: A batch processing pipeline utilizing the Aya 35B API was implemented to efficiently translate the full 550K LLaVA samples into each of the seven target languages. This pipeline included intermediate translation checkpointing, error handling, and logging for quality tracking and reproducibility. This resulted in the final 4.4 million sample dataset.

Practical Implementation Details: Model Architecture and Training

The Maya model architecture is inspired by LLaVA 1.5 and is designed for multilingual vision-language tasks (Figure 2).

LLM: Aya-23 8B was selected as the base LLM due to its inherent multilingual capabilities, trained on 23 languages (including the 8 relevant to the dataset). It has an 8K context window.
Vision Encoder: SigLIP (siglip-base-patch16-256-multilingual) is used instead of the more common CLIP. SigLIP was chosen for its strong performance, multilingual adaptability, and crucial ability to handle variable-length patch sizes through positional embedding interpolation, allowing it to accept images of varying dimensions. For an input image $X_v$ , SigLIP extracts visual features $Z_v$ .
Projection Layer: A trainable 2-layer MLP with GELU activation connects the visual features $Z_v$ from SigLIP to the language features $H_v$ compatible with the LLM. This layer learns to align the visual and language modalities.

The training process consists of two stages:

Pretraining: The goal is to align the image features with the LLM's input space. This stage exclusively trains the projection matrix $W$ $W$ .
- Dataset: The newly created multilingual LLaVA pretraining dataset (4.4 million samples).
- Input: For each image $X_v$ , its features $Z_v$ are extracted by the frozen SigLIP encoder, passed through the trainable projection layer $W$ to get $H_v$ , which is then concatenated with the multilingual language instructions $X_q$ . The combined sequence is fed into the frozen Aya-23 8B LLM to predict the multilingual language response $X_a$ .
- Image Preprocessing: Input images were cropped to 256x256 pixels for compatibility with the SigLIP encoder configuration used.
- Hardware/Config: Trained on 8x NVIDIA H100 GPUs with 80GB DRAM. A per-device batch size of 32 and a global batch size of 256 were used. Learning rate was 1e-3 with a cosine scheduler.
- Duration: Approximately 20 hours.
Finetuning: This stage instruction-tunes the pretrained model on specific vision-language tasks.
- Dataset: PALO 150K instruction-tuning dataset, which covers 10 languages (including the 8 used in pretraining plus Bengali and Urdu).
- Method: Full finetuning was employed, rather than LoRA, as LoRA resulted in suboptimal performance during experiments. Both the vision encoder (SigLIP) and the language encoder (Aya-23 8B) were kept frozen during this stage. Only the projection layer was further tuned (though the pretraining already tuned it, the paper implies this stage further refines the projection layer or related parts, or possibly some minimal layers related to instruction following based on common VLM finetuning practices, though explicitly states encoders are frozen).
- Hardware/Config: Trained on 8x NVIDIA H100 GPUs. A per-device batch size of 16 and a global batch size of 128 were used.
- Duration: Approximately 48 hours.

Evaluation and Results

Maya was evaluated on the PALO multilingual benchmark, covering 10 languages. Table 1 shows the results compared to LLaVA and PALO models of similar sizes (7B and 13B).

Maya (8B) achieved an average score of 60.4 across the 10 languages, outperforming both LLaVA-7B (46.9) and PALO-7B (57.7).
Comparing Maya (8B) to 7B models within its size class, it shows better performance in Chinese, Russian, Japanese, Arabic, and Hindi compared to PALO-7B. It particularly excels in Arabic (63.4 vs 57.8 for PALO-7B and 38.9 for LLaVA-7B), which the authors attribute to its multilingual pretraining and effective prompt design for root-based languages.
When compared to 13B models, Maya (8B) is competitive, outperforming LLaVA-13B (49.9) on average and trailing PALO-13B (61.9) by a small margin.

Maya was also evaluated on several English-only benchmarks (Table 2), showing performance comparable to other VLMs on standard tasks like VQA (GQA, VizWiz, TextVQA), reasoning (ScienceQA), and general understanding (MMBench, MM-VeT, MME).

Qualitative examples (Figures 3 and 4) show Maya generating descriptions for an image in various languages, highlighting its ability to provide relevant visual information in different linguistic contexts.

Implementation Considerations and Limitations

Computational Resources: Training Maya requires significant resources (8x H100 GPUs with 80GB VRAM) for both pretraining and finetuning. Inference will also require substantial GPU memory, though potentially less depending on batch size and quantization techniques.
Multilingual Data Generation: The dataset creation pipeline is complex, relying on multiple external LLMs and involving significant quality control steps. Reproducing or extending this to more languages would require access to similar high-quality translation models and computational resources.
Frozen Encoders: Freezing the vision and language encoders simplifies training and reduces the number of trainable parameters, making it faster and more memory efficient than full finetuning of the entire VLM. However, it might limit the model's ability to fully adapt the encoders to the multilingual and multimodal task, potentially leaving performance gains on the table compared to approaches that strategically finetune parts of the base models.
Dataset Quality: While efforts were made for quality control, the paper acknowledges that multilingual multimodal datasets can suffer from quality issues and distribution biases. The reliance on LLMs for translation might introduce artifacts or inaccuracies.
Evaluation Scope: While evaluated on 10 languages during finetuning, the pretraining only covered 8. Performance on languages outside the pretraining set (like Bengali and Urdu in the PALO benchmark) might be less robust. Benchmarking multilingual VLMs is an active area of research, and performance can be sensitive to the evaluation dataset and metrics.
Toxicity: The authors mention curating data to minimize harmful content but acknowledge potential traces may remain, a common challenge in large-scale dataset creation.

Real-World Applications

Maya's multilingual capability enables several practical applications:

Global Content Creation: Generating image descriptions, captions, or visual stories in multiple languages for diverse audiences.
Multilingual Accessibility: Providing accessible image information to users who speak languages other than English.
Cross-Cultural Communication: Facilitating understanding of visual content that may contain culturally specific elements.
Localized E-commerce/Marketing: Automatically generating product descriptions from images in local languages.
Educational Tools: Creating interactive learning materials with visual content explained in different languages.
International Research: Analyzing images or visual data and generating insights in various languages.

The release of Maya as an open-source model, along with the dataset creation methodology, provides a valuable resource for developers and researchers looking to build or apply multilingual VLMs in practice. Future work mentioned by the authors, such as improving cross-modal alignment and expanding the dataset, suggests ongoing efforts to enhance the model's capabilities.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (19)

First 10 authors:

GitHub

GitHub - nahidalam/maya: Maya: An Instruction Finetuned Multilingual Multimodal Model using Aya (108 stars)

Tweets

https://twitter.com/Karthik_kanjula/status/1924827813804171763

https://twitter.com/TheTuringPost/status/1925335387079405841

https://twitter.com/CSVisionPapers/status/1923119286769397884