Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

449

Aya 23: Open Weight Releases to Further Multilingual Progress (2405.15032v2)

Published 23 May 2024 in cs.CL

Abstract: This technical report introduces Aya 23, a family of multilingual LLMs. Aya 23 builds on the recent release of the Aya model (\"Ust\"un et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual LLM serving 23 languages, expanding state-of-art LLMing capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.

PDF HTML Abstract

An Analysis of Aya 23: Multilingual Instruction-Tuned LLMs

The introduction of the Aya 23 family posits a significant advancement in multilingual NLP. Unlike previous models that are predominantly English-centric, the Aya 23 spans 23 languages, aimed at addressing the performance disparities across languages by leveraging Cohere's Command model architecture. This paper undertakes a comprehensive evaluation of the Aya 23 models' capacity for handling multilingual tasks using a multi-faceted benchmark approach.

The paper identifies two major bottlenecks in the development of robust multilingual LLMs: the lack of robust multilingual pre-trained models and the scarcity of language-diverse instruction-style training data. The Aya initiative itself, leading to Aya 101 and subsequently to Aya 23, was predicated on mitigating these issues by offering a robust multilingual instruction-style dataset and leveraging the relatively up-to-date Command R model.

The Aya 23 marks a departure from the Aya 101 approach by concentrating resources on 23 languages rather than attempting to cover 101 languages as Aya 101 does. The consolidation was driven to counteract the limitations of the so-called "curse of multilinguality," which posits that increasing language breadth often leads to reduced per-language performance due to distributed model capacity.

Model Architecture and Training

Aya 23 employs a state-of-the-art infrastructure, building on recent advancements in the architecture of decoder-only transformers. Noteworthy architectural features include:

Parallel Attention and FFN layers for enhanced training efficiency.
SwiGLU Activation which demonstrated superior downstream performance.
RoPE for improved long-context understanding and extrapolation capabilities.
Grouped Query Attention (GQA) which reduces the inference-time memory footprint in the 8B model configuration.

The models are trained using a robust TPU v4 infrastructure facilitated by a distributed Jax-based framework, showcasing a methodologically rigorous approach to high-throughput, efficient training.

Instruction Fine-Tuning

The instruction fine-tuning phase employs a diverse mixture of multilingual data sources encompassing structured templates from datasets like xP3x, human annotations, translated subsets, and synthetic data generated via machine translation and Cohere's models. This extensive and varied dataset ensures that the Aya 23 models are well-rounded in handling the complexities inherent in multilingual text processing.

Evaluation and Results

The paper uses a multi-layered evaluation framework, assessing the models on discriminative tasks, language understanding, mathematical reasoning, and generative tasks. Distinctions between baseline models and Aya 23 are articulated throughout the evaluation results.

Discriminative Tasks: Aya-23-35B outperforms all baselines in accuracy, with a significant 70.8% average score across tasks like XCOPA, XStoryCloze, and XWinoGrad.
Multilingual MMLU: The Aya models exhibit superior performance with Aya-23-35B achieving 58.2% accuracy—outstripping similarly sized models on languages like Arabic, Hindi, and Vietnamese.
Mathematical Reasoning: Aya models markedly outperform baselines in solving math problems under native context settings, with Aya-23-35B achieving the highest scores.
Generative Tasks: Aya 23 models excel in machine translation and summarization, with Aya-23-35B leading at 43.0 spBleu for translation tasks.

The models also perform impressively well in GPT-4 simulated win-rate tests, consistently edging out competing models across a wide range of languages.

Implications and Future Directions

The Aya 23 models underscore the importance of both selective multilingual pre-training and robust instruction fine-tuning in creating high-performance LLMs. The Aya family sets a precedent for future work aiming to balance linguistic breadth with depth, avoiding the pitfalls of overextended language distribution.

The Aya initiative's direction highlights various avenues for future work. One crucial aspect is expanding language coverage to underrepresented groups, particularly those prevalent in Asia and Africa. Addressing this imbalance aligns with broader goals of equitable technological advancement. Moreover, improving model safety, reducing generational biases, and addressing the inherent cultural sensitivities in language can form pillars for subsequent research.

Conclusion

Aya 23 exemplifies a significant step towards overcoming historical linguistic biases in NLP systems by ensuring high performance across a focused set of 23 languages. By releasing model weights and comprehensive evaluation frameworks, the paper envisions facilitating future research and practical applications, enriching the landscape of multilingual AI and fostering broader linguistic inclusivity.

PDF Markdown Bookmark Chat (Pro)

References (63)

Authors (21)

Viraat Aryabumi (8 papers)
John Dang (8 papers)
Dwarak Talupuru (5 papers)
Saurabh Dash (10 papers)
David Cairuz (5 papers)
Hangyu Lin (11 papers)
Bharat Venkitesh (10 papers)
Madeline Smith (4 papers)
Kelly Marchisio (19 papers)
Sebastian Ruder (93 papers)
Acyr Locatelli (14 papers)
Julia Kreutzer (44 papers)
Nick Frosst (6 papers)
Phil Blunsom (87 papers)
Marzieh Fadaee (40 papers)
Ahmet Üstün (38 papers)
Sara Hooker (71 papers)
Jon Ander Campos (20 papers)
Yi Chern Tan (9 papers)
Max Bartolo (29 papers)

Citations (49)

View on Semantic Scholar

Tweets

https://twitter.com/sarahookr/status/1795238769353687410

https://twitter.com/_akhaliq/status/1794916122648584570

https://twitter.com/arankomatsuzaki/status/1794918625310380077

https://twitter.com/CohereForAI/status/1799048662972264508

https://twitter.com/CohereForAI/status/1808220260262359134

https://twitter.com/CohereForAI/status/1795128224403370476