Aya Vision: Advancing the Frontier of Multilingual Multimodality (2505.08751v1)

Published 13 May 2025 in cs.CL, cs.CV, and cs.LG

Abstract: Building multimodal LLMs is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.

Summary

The paper introduces a synthetic annotation framework that increases token count and lexical diversity in multimodal data.
The model employs a late-fusion architecture combining vision and language modules, achieving up to 79% win-rate on multilingual tasks.
A cross-modal merging strategy mitigates catastrophic forgetting, preserving text-only capabilities while enhancing multimodal performance.

Aya Vision: Techniques and Results in Multilingual Multimodal LLMing

Introduction and Motivation

Aya Vision addresses the persistent challenges in scaling multimodal LLMs (MLLMs) to multilingual settings. The paper identifies two primary bottlenecks: (1) the scarcity of high-quality, linguistically and culturally diverse multimodal instruction data, and (2) catastrophic forgetting of text-only capabilities when vision modalities are introduced, especially in multilingual contexts. Existing approaches relying on naive machine translation and simplistic image-text pairs are shown to be insufficient for robust, real-world conversational performance across languages.

Synthetic Multilingual Multimodal Data Generation

Aya Vision introduces a comprehensive synthetic annotation framework for constructing high-quality, diverse multilingual multimodal instruction datasets. The pipeline consists of:

Distillation-based Recaptioning: Task-specific prompt templates are used to generate detailed, natural, and diverse recaptions anchored to ground-truth answers. This increases average token count from 27.2 to 140.8 and lexical diversity (MTLD) from 11.0 to 61.2.
Two-stage Filtering: Keyword-based filtering removes basic errors, while LLM-based semantic filtering discards hallucinated or semantically inconsistent outputs, yielding a 3.2% error rate post-filtering.
Hybrid Translation and Rephrasing: Initial translations via NLLB-3.3B are post-edited by a multilingual LLM (command-r-plus-08-2024), correcting translationese and improving fluency. COMET scores improve from 0.7455 to 0.8293, with substantial gains in low-resource languages.

The data mixture is carefully balanced: 35% multilingual data, 31% synthetically re-annotated English, and 34% high-quality original datasets, with upsampling of underrepresented tasks to ensure coverage.

Model Architecture and Training

Aya Vision models employ a late-fusion architecture:

Vision Encoder: SigLIP2-SO400M, with patch sizes and resolutions tuned for efficiency (8B: patch14-384, 32B: patch16-512).
Connector: 2-layer MLP with SwiGLU, pixel shuffle for token reduction, and explicit tile tags for positional encoding.
LLM: Multilingually post-trained LLMs (Command-R7B for 8B, Aya-Expanse-32B for 32B).

Training proceeds in two stages: (1) vision-language alignment (connector only, high LR, frozen backbone), and (2) supervised fine-tuning (SFT) on the multimodal instruction mixture, with both full and LoRA-based finetuning explored.

A novel cross-modal model merging strategy is introduced to mitigate catastrophic forgetting and recover text-only performance. Linear interpolation of weights between the multimodal and text-only LLMs (excluding the vision encoder and connector) is performed:

$W_{\text{merged}} = \alpha W_{\text{mm-LLM}} + (1-\alpha) W_{\text{text-LLM}}$

Empirical ablations show that merging with $\alpha=0.4$ yields optimal trade-offs, boosting text-only win-rates by up to 50.2% and multimodal win-rates by up to 20.5% over the unmerged checkpoint. This approach is shown to outperform simply adding text-only data to the SFT mixture, both in efficiency and final performance.

Evaluation Suite and Benchmarks

Aya Vision is evaluated on a comprehensive suite:

Open-ended Multimodal Preference: AyaVisionBench (23 languages, 9 tasks), m-WildVision, and xChatBench, using VLM-as-a-judge protocols (Claude-3-7-Sonnet).
Academic Multimodal Benchmarks: xMMMU, MaXM, CVQA, MTVQA, Kaleidoscope.
Text-only Benchmarks: m-ArenaHard, MGSM, Global MMLU-Lite, FLORES, IFEval.

Results

Multimodal Performance

Aya-Vision-8B achieves best-in-class win-rates (49.6%–80.3%) on open-ended multimodal tasks, outperforming Qwen-2.5-VL-7B, Pixtral-12B, Gemini-Flash-1.5-8B, and Pangea-7B, with up to 79% win-rate across 23 languages.
Aya-Vision-32B outperforms models more than twice its size (Molmo-72B, Llama-3.2-90B-Vision, Qwen-2.5-VL-72B), with win-rates up to 73% and strong efficiency gains.

Text-only Performance

Aya Vision models retain strong text-only capabilities, with degradation limited to 5.9% post-merging, compared to 16.4%–44.1% in other models.
On academic text-only benchmarks, Aya-Vision-8B and 32B are competitive with or outperform larger models, especially in open-ended preference evaluations.

Ablations

Data improvements (synthetic annotation and filtering) yield the largest single boost in win-rates (17%), with total improvement reaching 30% when combined with model merging.
Increasing multilingual data beyond 35% in the mixture degrades performance due to repeated exposure and limited diversity, confirming the need for balanced cross-lingual transfer.
LoRA-based finetuning is comparable to full finetuning, with minor advantages in text-only retention.

Implications and Future Directions

Aya Vision demonstrates that careful data curation, synthetic annotation, and model merging can substantially improve multilingual multimodal LLMs without excessive scaling. The cross-modal merging paradigm offers a training-free, efficient path to adaptive models, suggesting future work in modular model composition and dynamic capability extension. The release of AyaVisionBench and open-weight models supports broader research in inclusive, linguistically diverse multimodal AI.

The strong empirical results—particularly the ability of Aya-Vision-8B and 32B to outperform much larger models—challenge prevailing assumptions about scaling laws and data requirements in MLLMs. The findings indicate that data quality, diversity, and architectural choices can "bend the need for compute" and deliver state-of-the-art performance with reduced resource consumption.

Conclusion

Aya Vision advances the state of multilingual multimodal modeling by introducing scalable synthetic data generation, robust filtering, and cross-modal model merging. The models set new Pareto frontiers in performance-efficiency trade-offs and establish best-in-class results across open-ended and academic benchmarks in 23 languages. The techniques and benchmarks presented have significant implications for future research in efficient, adaptive, and inclusive multimodal AI systems.

PDF Markdown

Follow-up Questions

Related Papers

Authors (25)

First 10 authors:

Tweets

https://twitter.com/mziizm/status/1922679072653463844

https://twitter.com/beyzaermis/status/1922722864198820196

https://twitter.com/dylan_curious/status/1923151478421651893

https://twitter.com/bohannon_bot/status/1923180931424428357

https://twitter.com/Cohere_Labs/status/1922670838152679589

https://twitter.com/TheyCallMeMr_/status/1922649674634108934

Reddit

Aya Vision: Advancing the Frontier of Multilingual Multimodality (48 points, 17 comments)