Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Aya 23: Open Weight Releases to Further Multilingual Progress (2405.15032v2)

Published 23 May 2024 in cs.CL

Abstract: This technical report introduces Aya 23, a family of multilingual LLMs. Aya 23 builds on the recent release of the Aya model (\"Ust\"un et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual LLM serving 23 languages, expanding state-of-art LLMing capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.

An Analysis of Aya 23: Multilingual Instruction-Tuned LLMs

The introduction of the Aya 23 family posits a significant advancement in multilingual NLP. Unlike previous models that are predominantly English-centric, the Aya 23 spans 23 languages, aimed at addressing the performance disparities across languages by leveraging Cohere's Command model architecture. This paper undertakes a comprehensive evaluation of the Aya 23 models' capacity for handling multilingual tasks using a multi-faceted benchmark approach.

The paper identifies two major bottlenecks in the development of robust multilingual LLMs: the lack of robust multilingual pre-trained models and the scarcity of language-diverse instruction-style training data. The Aya initiative itself, leading to Aya 101 and subsequently to Aya 23, was predicated on mitigating these issues by offering a robust multilingual instruction-style dataset and leveraging the relatively up-to-date Command R model.

The Aya 23 marks a departure from the Aya 101 approach by concentrating resources on 23 languages rather than attempting to cover 101 languages as Aya 101 does. The consolidation was driven to counteract the limitations of the so-called "curse of multilinguality," which posits that increasing language breadth often leads to reduced per-language performance due to distributed model capacity.

Model Architecture and Training

Aya 23 employs a state-of-the-art infrastructure, building on recent advancements in the architecture of decoder-only transformers. Noteworthy architectural features include:

  • Parallel Attention and FFN layers for enhanced training efficiency.
  • SwiGLU Activation which demonstrated superior downstream performance.
  • RoPE for improved long-context understanding and extrapolation capabilities.
  • Grouped Query Attention (GQA) which reduces the inference-time memory footprint in the 8B model configuration.

The models are trained using a robust TPU v4 infrastructure facilitated by a distributed Jax-based framework, showcasing a methodologically rigorous approach to high-throughput, efficient training.

Instruction Fine-Tuning

The instruction fine-tuning phase employs a diverse mixture of multilingual data sources encompassing structured templates from datasets like xP3x, human annotations, translated subsets, and synthetic data generated via machine translation and Cohere's models. This extensive and varied dataset ensures that the Aya 23 models are well-rounded in handling the complexities inherent in multilingual text processing.

Evaluation and Results

The paper uses a multi-layered evaluation framework, assessing the models on discriminative tasks, language understanding, mathematical reasoning, and generative tasks. Distinctions between baseline models and Aya 23 are articulated throughout the evaluation results.

  • Discriminative Tasks: Aya-23-35B outperforms all baselines in accuracy, with a significant 70.8% average score across tasks like XCOPA, XStoryCloze, and XWinoGrad.
  • Multilingual MMLU: The Aya models exhibit superior performance with Aya-23-35B achieving 58.2% accuracy—outstripping similarly sized models on languages like Arabic, Hindi, and Vietnamese.
  • Mathematical Reasoning: Aya models markedly outperform baselines in solving math problems under native context settings, with Aya-23-35B achieving the highest scores.
  • Generative Tasks: Aya 23 models excel in machine translation and summarization, with Aya-23-35B leading at 43.0 spBleu for translation tasks.

The models also perform impressively well in GPT-4 simulated win-rate tests, consistently edging out competing models across a wide range of languages.

Implications and Future Directions

The Aya 23 models underscore the importance of both selective multilingual pre-training and robust instruction fine-tuning in creating high-performance LLMs. The Aya family sets a precedent for future work aiming to balance linguistic breadth with depth, avoiding the pitfalls of overextended language distribution.

The Aya initiative's direction highlights various avenues for future work. One crucial aspect is expanding language coverage to underrepresented groups, particularly those prevalent in Asia and Africa. Addressing this imbalance aligns with broader goals of equitable technological advancement. Moreover, improving model safety, reducing generational biases, and addressing the inherent cultural sensitivities in language can form pillars for subsequent research.

Conclusion

Aya 23 exemplifies a significant step towards overcoming historical linguistic biases in NLP systems by ensuring high performance across a focused set of 23 languages. By releasing model weights and comprehensive evaluation frameworks, the paper envisions facilitating future research and practical applications, enriching the landscape of multilingual AI and fostering broader linguistic inclusivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Ethnologue. https://www.ethnologue.com/insights/how-many-languages/, 2023. Accessed: 2023-06-17.
  2. Breaking the unwritten language barrier: The bulb project. Procedia Computer Science, 81:8–14, 2016. ISSN 1877-0509. https://doi.org/10.1016/j.procs.2016.04.023. URL https://www.sciencedirect.com/science/article/pii/S1877050916300370. SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced languages 09-12 May 2016 Yogyakarta, Indonesia.
  3. Do all languages cost the same? tokenization in the era of commercial language models, 2023.
  4. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  5. Palm 2 technical report. arXiv, abs/2305.10403, 2023.
  6. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019.
  7. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Xnli: Evaluating cross-lingual sentence representations. pp.  2475–2485, October-November 2018. 10.18653/v1/D18-1269. URL https://aclanthology.org/D18-1269.
  10. Unsupervised cross-lingual representation learning at scale. pp.  8440–8451, July 2019. 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  11. Free dolly: Introducing the world’s first truly open instruction-tuned llm. Databricks, 2023a.
  12. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023b. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  13. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pp.  arXiv–2307, 2023.
  14. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023.
  15. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  16. Towards measuring the representation of subjective global opinions in language models. arXiv, abs/2306.16388, 2023.
  17. A framework for few-shot language model evaluation. 12 2023. 10.5281/zenodo.10256836. URL https://zenodo.org/records/10256836.
  18. Gemini: A family of highly capable multimodal models, 2024.
  19. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  20. Gemma-Team. Gemma: Open models based on gemini research and technology, 2024.
  21. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. arXiv, abs/2106.03193, 2021.
  22. XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. pp.  4693–4703, August 2021. 10.48550/arXiv.2106.13822. URL https://aclanthology.org/2021.findings-acl.413.
  23. A material lens on coloniality in nlp. arXiv, abs/2311.08391, 2023.
  24. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  25. Mistral 7b, 2023.
  26. Mixtral of experts. arXiv, abs/2401.04088, 2024.
  27. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023.
  28. Casteist but not racist? quantifying disparities in large language model bias between india and the west. ArXiv, abs/2309.08573, 2023. URL https://api.semanticscholar.org/CorpusID:262013517.
  29. Gptaraeval: A comprehensive evaluation of chatgpt on arabic nlp. arXiv, abs/2305.14976, 2023.
  30. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023.
  31. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  32. Gender bias and stereotypes in large language models. Proceedings of The ACM Collective Intelligence Conference, 2023. URL https://api.semanticscholar.org/CorpusID:261276445.
  33. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation. arXiv, abs/2305.15011, 2023a.
  34. Privacy in large language models: Attacks, defenses and future directions. ArXiv, abs/2310.10383, 2023b. URL https://api.semanticscholar.org/CorpusID:264145758.
  35. Few-shot learning with multilingual language models. arXiv, abs/2112.10668, 2021.
  36. The flan collection: Designing data and methods for effective instruction tuning. arXiv, abs/2301.13688, 2023a.
  37. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787, 2023b.
  38. Analyzing leakage of personally identifiable information in language models. 2023 IEEE Symposium on Security and Privacy (SP), pp.  346–363, 2023. URL https://api.semanticscholar.org/CorpusID:256459554.
  39. Crosslingual generalization through multitask finetuning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  15991–16111, Toronto, Canada, July 2023. Association for Computational Linguistics. 10.18653/v1/2023.acl-long.891. URL https://aclanthology.org/2023.acl-long.891.
  40. Scalable extraction of training data from (production) language models. arXiv, abs/2311.17035, 2023.
  41. Lost in translation: Large language models in non-english content analysis. arXiv, abs/2306.07377, 2023.
  42. No language left behind: Scaling human-centered machine translation. 2022.
  43. How good are large language models on african languages? arXiv, abs/2311.07978, 2023.
  44. Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3479–3495, Seattle, United States, July 2022. Association for Computational Linguistics. 10.18653/v1/2022.naacl-main.255. URL https://aclanthology.org/2022.naacl-main.255.
  45. Xcopa: A multilingual dataset for causal commonsense reasoning. pp.  2362–2376, November 2020. 10.18653/v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185.
  46. Train short, test long: Attention with linear biases enables input length extrapolation. CoRR, abs/2108.12409, 2021. URL https://arxiv.org/abs/2108.12409.
  47. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  48. Towards a standard for identifying and managing bias in artificial intelligence. NIST special publication, 1270(10.6028), 2022.
  49. Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
  50. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=fR3wGCk-IXp.
  51. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619, 2024.
  52. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  53. Stanford alpaca: An instruction-following llama model. 2023.
  54. Llama: Open and efficient foundation language models. arXiv, abs/2302.13971, 2023a.
  55. Llama 2: Open foundation and fine-tuned chat models. arXiv, abs/2307.09288, 2023b.
  56. On evaluating and mitigating gender biases in multilingual settings. arXiv, abs/2307.01503, 2023.
  57. mt5: A massively multilingual pre-trained text-to-text transformer. pp.  483–498, June 2020. 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
  58. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
  59. Low-resource languages jailbreak GPT-4. arXiv, abs/2310.02446, 2023a.
  60. BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11682–11703, Toronto, Canada, July 2023b. Association for Computational Linguistics. 10.18653/v1/2023.acl-long.653. URL https://aclanthology.org/2023.acl-long.653.
  61. Scalable training of language models using jax pjit and tpuv4, 2022.
  62. Llama beyond english: An empirical study on language capability transfer. arXiv, abs/2401.01055, 2024.
  63. Aya model: An instruction finetuned open-access multilingual language model, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (21)
  1. Viraat Aryabumi (8 papers)
  2. John Dang (8 papers)
  3. Dwarak Talupuru (5 papers)
  4. Saurabh Dash (10 papers)
  5. David Cairuz (5 papers)
  6. Hangyu Lin (11 papers)
  7. Bharat Venkitesh (10 papers)
  8. Madeline Smith (4 papers)
  9. Kelly Marchisio (19 papers)
  10. Sebastian Ruder (93 papers)
  11. Acyr Locatelli (14 papers)
  12. Julia Kreutzer (44 papers)
  13. Nick Frosst (6 papers)
  14. Phil Blunsom (87 papers)
  15. Marzieh Fadaee (40 papers)
  16. Ahmet Üstün (38 papers)
  17. Sara Hooker (71 papers)
  18. Jon Ander Campos (20 papers)
  19. Yi Chern Tan (9 papers)
  20. Max Bartolo (29 papers)
Citations (49)