Aya Model: Multilingual LLM Series

Updated 3 February 2026

Aya models are a family of open-weight, instruction-finetuned multilingual language models designed for generative, discriminative, and preference tasks across up to 101 languages.
They span encoder–decoder and decoder-only architectures with specialized variants like Aya-101, Aya 23, and Aya Expanse to robustly address lower- and mid-resource language challenges.
Innovations include multilingual data arbitrage, direct preference optimization, and model merging techniques that yield significant performance gains and state-of-the-art benchmarks.

Aya is a family of open-weight, instruction-finetuned, massively multilingual LLMs developed through a series of research iterations by Cohere For AI and collaborators. Spanning encoder–decoder and decoder-only Transformer architectures, Aya models are designed to advance performance across generative, discriminative, and preference tasks for up to 101 languages, with an explicit focus on robust capability in lower- and mid-resource settings. The Aya model series—Aya-101, Aya 23, and Aya Expanse—proposes solutions to key challenges in multilinguality, such as data curation, efficient capacity allocation, bias mitigation, and state-of-the-art open-ended evaluation, and anchors much of its progress in methodical empirical comparison with contemporaneous LLMs and community benchmarks (Üstün et al., 2024, Aryabumi et al., 2024, Dang et al., 2024).

1. Model Series Structure and Architectural Foundation

Aya models comprise two principal families, both fundamentally grounded in the Transformer architecture:

Aya-101: A 13B-parameter encoder–decoder Transformer (based on mT5-XXL), trained and instruction-finetuned on 101 languages, covering an expansive typological and resource spectrum (Üstün et al., 2024). The model makes no architectural deviations from mT5-XXL: 24-layer encoder, 24-layer decoder, hidden size 2048, filter size 8192, and 64 attention heads per block.
Aya 23: A decoder-only Transformer family in 8B and 35B parameter scales, built upon Cohere’s Command series and trained on 23 core languages to enable greater per-language capacity (“depth”) and comparative analysis of coverage vs. specialization. Notable architectural features include SwiGLU activations, grouped-query attention (GQA, 8B only), parallel Attention+FFN blocks, rotary position embeddings (RoPE), and a 256K BPE vocabulary (Aryabumi et al., 2024).
Aya Expanse (8B and 32B): The latest decoder-only family (8B, 32B), introducing advances in multilingual data arbitrage, iterative preference optimization via Direct Preference Optimization (DPO), and systematic checkpoint merging. Both variants retain the Cohere Command backbone, SwiGLU, RoPE, and GQA; the 32B model increases context length to 128K tokens. Instruction fine-tuning employs chat-style dialogue templates (Dang et al., 2024).
Aya Vision (8B, 32B): A late-fusion multimodal extension incorporating a pretrained vision encoder (SigLIP 2-so400m), a vision-language connector (2-layer MLP, SwiGLU), and a multilingual LLM backbone (Command-R7B/Aya-Expanse). The architecture ensures minimal cross-modal interference by withholding direct architectural entanglement (Dash et al., 13 May 2025).

2. Training Data, Instruction-Finetuning, and Mixture Optimization

Aya’s empirical strength derives from large-scale, typologically broad data and refined sampling pipelines:

Pretraining: Aya-101 pretrains on mC4 (∼1T tokens, 101 languages, span-masking); Aya 23 and Expanse pretrain on web-scale corpora covering 23 (and in Expanse, additionally, up to 32) languages from multiple linguistic families. Tokenization, normalization, and n-gram deduplication enforce internal consistency (Üstün et al., 2024, Aryabumi et al., 2024, Dang et al., 2024, Trinley et al., 27 Jul 2025).
Instruction-Finetuning Mix: Aya-101 and Aya 23 integrate sources including pruned xP3x instruction templates (multi-domain, 101+ languages), human-reviewed examples, command-generated synthetic data, and extensive machine-translated instructions. Curation eliminates empty, trivial, or poorly constructed prompts to raise average task difficulty and reduce overfitting (Üstün et al., 2024, Aryabumi et al., 2024).
Mixture Weights: Empirical ablations favor a translation-heavy finetuning mix (∼47.5% translation, ∼15% xP3x, ∼22.5% synthetic, other weights apportioned between human-curated and template data), maximizing open-ended win rates and generative task transfer (Üstün et al., 2024).
Multimodal Data Generation: Aya Vision introduces a recaptioning and filtering pipeline to synthesize high-quality multimodal-instruction data, with expert LLM semantic filtering and hybrid translate-rephrase steps; COMET scores confirm improved translation quality post-rewriting (Dash et al., 13 May 2025).

3. Core Methodological Innovations

Aya’s successive generations introduce a suite of methodological advances for multilingual state-of-the-art performance:

Multilingual Data Arbitrage: Aya Expanse produces synthetic supervision by routing each prompt to the single best teacher LLM from a pool, determined via an internal reward model. Only top-ranked completions are retained per prompt–language pair, maximizing synthetic data quality (Dang et al., 2024).
Direct Preference Optimization (DPO): DPO is applied in two phases—offline (using arbitrage-derived preference pairs) and online (iteratively sampling new completions and refining on model-judged pairs). Ablations show a +7.1 percentage point win-rate gain over single-stage preference training (Dang et al., 2024).
Model Merging: Weighted linear averaging integrates multiple fine-tuned checkpoints across languages and post-training stages; empirical results indicate simple convex averaging outperforms more complex geometric interpolation methods (SLERP, TIES). At 32B scale, this yields up to 3× larger performance gains than at 8B (Dang et al., 2024).
Cross-Modal Model Merging: Aya Vision merges vision-tuned and pure-text weights, preserving both modalities. With optimal weighting ( $\alpha \approx 0.4$ ), joint text and vision win-rate losses are minimized (text-only loss ∼5.9% vs. up to 44% for other approaches), while unimodal and multimodal capabilities are retained (Dash et al., 13 May 2025).

4. Empirical Evaluation and Quantitative Comparisons

Aya models are systematically benchmarked via discriminative, generative, and preference-based evaluations, emphasizing breadth and rigor.

Discriminative (Zero-Shot) Tasks: Aya-101 achieves mean accuracy of 75.1% on average across XCOPA/XNLI/XStoryCloze/XWinograd, outperforming mT0 and BLOOMZ, its principal pre-Expanse rivals (Üstün et al., 2024). Aya 23 35B reaches 70.8% (zero-shot), a gain of +14 pp versus Aya-101 (Aryabumi et al., 2024). In Expanse, 8B and 32B variants attain 70.3% and 72.6% (discriminative), overtaking prior releases (Dang et al., 2024).
Multilingual MMLU (5-shot, 23–31 langs): Aya-101 (37.3% accuracy), Aya 23 8B (48.2%), Aya 23 35B (58.2%), Aya Expanse 8B (53.7%), Expanse 32B (66.9%) (Üstün et al., 2024, Aryabumi et al., 2024, Dang et al., 2024).
Generation: Translation and Summarization: On FLORES-200 (93 langs), Aya-101 achieves spBLEU 29.1 (X→En); Aya 23 8B achieves 39.5, Aya 23 35B 43.0, and Aya Expanse 32B leads the class with chrF++ 58.8 and xCOMET 93.5, representing a +2.9 BLEU effect size over competitors (Üstün et al., 2024, Aryabumi et al., 2024, Dang et al., 2024).
Open-Ended Preference (“Win Rate”): Across m-ArenaHard (23 languages, GPT-4o judge), Aya Expanse 8B vs. baselines attains pairwise win rates of up to 70.6% (vs. Llama 3.1 8B), while Expanse 32B achieves 54.0% vs. Llama 3.1 70B, a model twice its size (Dang et al., 2024).
Multimodal Benchmarks: Aya Vision models outperform competitors (Pixtral, Qwen-2.5-VL, Llama-3.2-90B-Vision, Molmo-72B) on open-ended and academic multimodal tasks; e.g., Aya Vision-8B achieves 100% win rate vs. Pixtral-12B and 71.7% vs. Pangea-7B in representative pairwise evaluations (Dash et al., 13 May 2025).
Few-Shot and Parameter-Efficient Fine-Tuning: On FarExStance (Farsi stance detection/explanation), PEFT-finetuned Aya-23-8B attains 72.9 macro-F1, closely tracking a fully trained XLM-RoBERTa (74.5 macro-F1), and produces best-in-class extractive evidence explanations (highest ROUGE-L for Aya-32-8B) (Zarharan et al., 2024).

5. Model Behavior, Code-Mixing, and Internal Representations

Analyses of Aya-23-8B’s internal language representations yield insights into multilingual learning dynamics:

Polyglot Internal Activations: During translation, Aya-23-8B’s intermediate layers activate typologically related languages, unlike English-centric LLMs (e.g., Llama 3.1) that pivot via a single dominant language. Statistical tests show broader latent activation for non-pivot languages, especially on output language (Trinley et al., 27 Jul 2025).
Code-Mixing Robustness: Aya-23-8B maintains graceful BLEU degradation as mixing rate increases and is more resilient to cross-script mixing than monolingual models. Shared-script language pairs (e.g., fr–en, zh–ja) yield higher cross-neuron overlaps (Trinley et al., 27 Jul 2025).
Specialization in Final Layers: Language- and code-mix-specific neuron activations are concentrated in the last layers (27–31 of 32), diverging from prior decoder-only findings of more dispersed specialization. This suggests shared linguistic representations in earlier layers, with output-centric specialization, supporting modularity for interpretability and potential transfer (Trinley et al., 27 Jul 2025).
Typological and Script Effects: Neuron-overlap patterns indicate substrate influences: e.g., higher overlaps for Romance languages, intermediate overlaps for Japanese–Chinese and Korean–Chinese, reflecting script and historical contact. These findings support the thesis that training on balanced, typology-diverse corpora improves robustness against linguistic drift and real-world heterogeneity (Trinley et al., 27 Jul 2025).

6. Safety, Bias, and Model Accessibility

Aya models are systematically evaluated for safety and bias, and prioritized for open scientific access:

Toxicity and Degeneration: Aya and Aya Safe minimize Expected Maximum Toxicity (EMT ≈ 0.22 vs mT0x ≈ 0.30) and reduce toxicity probability on non-toxic prompts and identity group outputs across 7–14 languages (Üstün et al., 2024).
Gender, Stereotype, and Fairness: On Wino-MT (gender bias in translation), Aya Safe achieves the lowest gender gap (ΔG) with competitive accuracy; on PALM-style prompts, toxicity is reduced in 6/7 languages (Üstün et al., 2024).
Safety Interventions: The addition of machine-translated safety preambles and explicit context distillation enables ∼88% harmful completion rejection (zero-shot), with minimal degradation (0.2–3.2 pp) in overall task accuracy (Üstün et al., 2024).
Open-Source Assets: All core Aya models, data, evaluation scripts, and training code (T5x/SeqIO) are released under Apache 2.0 or similar permissive licenses. Model weights are available on Hugging Face; no restrictions prevent downstream use or extension (Üstün et al., 2024, Aryabumi et al., 2024, Dang et al., 2024).

7. Practical Impact and Limitations

Aya models systematically expand multilingual coverage and performance:

State-of-the-art open-weight LLMs: Aya Expanse 8B and 32B set new benchmarks for 23 languages in both open-ended and standardized academic evaluation, outperforming comparably sized and significantly larger models, including Llama 3.1 70B (Dang et al., 2024).
Compute-Efficient Multimodality: Aya Vision demonstrates that high-quality multimodal instruction tuning and cross-modal merging substantially reduce the compute and retraining required for multilingual VLMs, supporting scalable deployment even in compute-constrained environments (Dash et al., 13 May 2025).
Parameter Efficiency: LoRA/QLoRA finetuning for stance and explanation tasks enables high performance using less than 0.1% of model parameters, lowering the barrier to domain adaptation and cross-lingual extension (Zarharan et al., 2024).

Identified limitations include remaining shortfalls in global explanation coherence compared to few-shot GPT-4o, 20% completeness errors in extractive rationalization, and ongoing hardware demands for large-scale inference and finetuning (Zarharan et al., 2024). Some code-mixing analyses highlight the persistent challenge of differentiating between script overlap and true linguistic representation. Extension to lower-resource languages and more nuanced probing of typological generalization remain key research frontiers (Trinley et al., 27 Jul 2025).

Key references:

Aya Expanse (Dang et al., 2024), Aya-101 (Üstün et al., 2024), Aya 23 (Aryabumi et al., 2024), internal representation analysis (Trinley et al., 27 Jul 2025), multimodal Aya Vision (Dash et al., 13 May 2025), parameter-efficient adaptation (Zarharan et al., 2024), PAN detoxification (Rykov et al., 2024).