- The paper introduces a synthetic annotation framework that increases token count and lexical diversity in multimodal data.
- The model employs a late-fusion architecture combining vision and language modules, achieving up to 79% win-rate on multilingual tasks.
- A cross-modal merging strategy mitigates catastrophic forgetting, preserving text-only capabilities while enhancing multimodal performance.
Aya Vision: Techniques and Results in Multilingual Multimodal LLMing
Introduction and Motivation
Aya Vision addresses the persistent challenges in scaling multimodal LLMs (MLLMs) to multilingual settings. The paper identifies two primary bottlenecks: (1) the scarcity of high-quality, linguistically and culturally diverse multimodal instruction data, and (2) catastrophic forgetting of text-only capabilities when vision modalities are introduced, especially in multilingual contexts. Existing approaches relying on naive machine translation and simplistic image-text pairs are shown to be insufficient for robust, real-world conversational performance across languages.
Synthetic Multilingual Multimodal Data Generation
Aya Vision introduces a comprehensive synthetic annotation framework for constructing high-quality, diverse multilingual multimodal instruction datasets. The pipeline consists of:
- Distillation-based Recaptioning: Task-specific prompt templates are used to generate detailed, natural, and diverse recaptions anchored to ground-truth answers. This increases average token count from 27.2 to 140.8 and lexical diversity (MTLD) from 11.0 to 61.2.
- Two-stage Filtering: Keyword-based filtering removes basic errors, while LLM-based semantic filtering discards hallucinated or semantically inconsistent outputs, yielding a 3.2% error rate post-filtering.
- Hybrid Translation and Rephrasing: Initial translations via NLLB-3.3B are post-edited by a multilingual LLM (command-r-plus-08-2024), correcting translationese and improving fluency. COMET scores improve from 0.7455 to 0.8293, with substantial gains in low-resource languages.
The data mixture is carefully balanced: 35% multilingual data, 31% synthetically re-annotated English, and 34% high-quality original datasets, with upsampling of underrepresented tasks to ensure coverage.
Model Architecture and Training
Aya Vision models employ a late-fusion architecture:
- Vision Encoder: SigLIP2-SO400M, with patch sizes and resolutions tuned for efficiency (8B: patch14-384, 32B: patch16-512).
- Connector: 2-layer MLP with SwiGLU, pixel shuffle for token reduction, and explicit tile tags for positional encoding.
- LLM: Multilingually post-trained LLMs (Command-R7B for 8B, Aya-Expanse-32B for 32B).
Training proceeds in two stages: (1) vision-language alignment (connector only, high LR, frozen backbone), and (2) supervised fine-tuning (SFT) on the multimodal instruction mixture, with both full and LoRA-based finetuning explored.
Cross-Modal Model Merging
A novel cross-modal model merging strategy is introduced to mitigate catastrophic forgetting and recover text-only performance. Linear interpolation of weights between the multimodal and text-only LLMs (excluding the vision encoder and connector) is performed:
Wmerged=αWmm-LLM+(1−α)Wtext-LLM
Empirical ablations show that merging with α=0.4 yields optimal trade-offs, boosting text-only win-rates by up to 50.2% and multimodal win-rates by up to 20.5% over the unmerged checkpoint. This approach is shown to outperform simply adding text-only data to the SFT mixture, both in efficiency and final performance.
Evaluation Suite and Benchmarks
Aya Vision is evaluated on a comprehensive suite:
- Open-ended Multimodal Preference: AyaVisionBench (23 languages, 9 tasks), m-WildVision, and xChatBench, using VLM-as-a-judge protocols (Claude-3-7-Sonnet).
- Academic Multimodal Benchmarks: xMMMU, MaXM, CVQA, MTVQA, Kaleidoscope.
- Text-only Benchmarks: m-ArenaHard, MGSM, Global MMLU-Lite, FLORES, IFEval.
Results
- Aya-Vision-8B achieves best-in-class win-rates (49.6%–80.3%) on open-ended multimodal tasks, outperforming Qwen-2.5-VL-7B, Pixtral-12B, Gemini-Flash-1.5-8B, and Pangea-7B, with up to 79% win-rate across 23 languages.
- Aya-Vision-32B outperforms models more than twice its size (Molmo-72B, Llama-3.2-90B-Vision, Qwen-2.5-VL-72B), with win-rates up to 73% and strong efficiency gains.
Text-only Performance
- Aya Vision models retain strong text-only capabilities, with degradation limited to 5.9% post-merging, compared to 16.4%–44.1% in other models.
- On academic text-only benchmarks, Aya-Vision-8B and 32B are competitive with or outperform larger models, especially in open-ended preference evaluations.
Ablations
- Data improvements (synthetic annotation and filtering) yield the largest single boost in win-rates (17%), with total improvement reaching 30% when combined with model merging.
- Increasing multilingual data beyond 35% in the mixture degrades performance due to repeated exposure and limited diversity, confirming the need for balanced cross-lingual transfer.
- LoRA-based finetuning is comparable to full finetuning, with minor advantages in text-only retention.
Implications and Future Directions
Aya Vision demonstrates that careful data curation, synthetic annotation, and model merging can substantially improve multilingual multimodal LLMs without excessive scaling. The cross-modal merging paradigm offers a training-free, efficient path to adaptive models, suggesting future work in modular model composition and dynamic capability extension. The release of AyaVisionBench and open-weight models supports broader research in inclusive, linguistically diverse multimodal AI.
The strong empirical results—particularly the ability of Aya-Vision-8B and 32B to outperform much larger models—challenge prevailing assumptions about scaling laws and data requirements in MLLMs. The findings indicate that data quality, diversity, and architectural choices can "bend the need for compute" and deliver state-of-the-art performance with reduced resource consumption.
Conclusion
Aya Vision advances the state of multilingual multimodal modeling by introducing scalable synthetic data generation, robust filtering, and cross-modal model merging. The models set new Pareto frontiers in performance-efficiency trade-offs and establish best-in-class results across open-ended and academic benchmarks in 23 languages. The techniques and benchmarks presented have significant implications for future research in efficient, adaptive, and inclusive multimodal AI systems.