Emergent Mind

Abstract

Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 .
Aurora-M outperforms StarCoderPlus in code and multilingual benchmarks, showing higher Pass@1 and 0-shot accuracy.

Overview

  • Aurora-M is a 15B parameter open-source multilingual Large Language Model (LLM) pre-trained on over 2 trillion tokens, encompassing languages like English, Finnish, Hindi, Japanese, Vietnamese, and code, aligning with the Biden-Harris Executive Order on AI safety.

  • The model underwent a two-stage training curriculum, using diverse datasets and focusing on both general and instruction-tuned datasets to enhance its capabilities and safety alignment.

  • Training techniques included the use of the LUMI supercomputer, mixed precision training, and a meticulously optimized learning schedule, prioritizing environmental considerations like hydro-powered energy use.

  • Aurora-M demonstrated superior performance in multilingual tasks and coding-related tasks, particularly in safety evaluations, underlining its commitment to producing ethically and legally sound content.

Overview of Aurora-M

The paper introduces Aurora-M, a 15B parameter open-source multilingual Large Language Model (LLM) that has been continually pretrained on a diverse and extensive dataset. Unlike its predecessors, Aurora-M stands out not only for its multilingual capabilities, which cover English, Finnish, Hindi, Japanese, Vietnamese, and code, but also for its alignment with stringent AI safety and legal standards, specifically the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. The model was continually pretrained from the StarCoderPlus model on an additional 435 billion tokens, reaching a staggering total of over 2 trillion tokens. This comprehensive training enables Aurora-M to demonstrate robustness against catastrophic forgetting and superior performance in multilingual settings, particularly in safety evaluations.

Data Curation and Processing

The dataset preparation for Aurora-M involved a two-stage training curriculum, integrating general text data from diverse sources, covering both natural languages and coding languages, along with instruction-tuning datasets. The Continual Auxiliary Pretraining (CAP) stage utilized general web data and multilingual datasets from sources like RefinedWeb and the Pile, while the Continual Alignment Tuning (CAT) stage focused on further boosting its capabilities in specific areas and aligning with safety objectives. Rigorous data filtering techniques were employed to ensure the high quality and relevance of training data, addressing challenges like toxic content removal and sensitive information anonymization.

Training Methodology

Aurora-M's training exploited advanced techniques, including the use of the LUMI supercomputer, mixed precision training, and a carefully optimized learning rate schedule, culminating in a training period of 48 days. This training was not only highly efficient but also environmentally considerate, using 100% hydro-powered energy and incorporating waste heat recycling.

Emphasis on Safety and Legal Compliance

A critical aspect of Aurora-M's development was its instruction-tuning on a carefully curated dataset designed to align with the Biden-Harris Executive Order’s focus areas. This safety consideration is crucial for mitigating risks related to AI applications and ensuring the model’s outputs adhere to accepted ethical and legal standards. The construction of this tailored safety dataset underscores a proactive approach to addressing contemporary concerns regarding AI safety and compliance.

Evaluation and Performance

Aurora-M was subjected to comprehensive evaluations across a range of tasks and languages. Its performance was benchmarked against leading models, showcasing its enhanced capabilities in multilingual language understanding and generation, as well as in coding-related tasks. Notably, Aurora-M demonstrated superior performance in safety evaluations, affirming its commitment to producing legally compliant and ethically sound content.

Contributions and Future Directions

The development of Aurora-M represents a significant step forward in the field of AI research, particularly in fostering open-source LLM development. The model's release is intended to encourage further research and innovation, with its underlying datasets and training methodologies made accessible for community refinement and expansion. Looking ahead, there are plans to explore continual training of Aurora-M on advanced base models and expand its domain-specific expertise, leveraging the insights gained from this project to push the boundaries of AI capabilities while maintaining a steadfast commitment to safety and legal compliance.

In conclusion, Aurora-M embodies a harmonious blend of technical excellence, multilingual inclusivity, and unwavering commitment to safety and ethical AI development. Its introduction paves the way for further advancements in LLM research and applications, promising wider accessibility and responsible innovation in the AI domain.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. Towards a cleaner document-oriented multilingual crawled corpus
  2. Program synthesis with large language models
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity
  5. Findings of the 2020 conference on machine translation (WMT20). In Proceedings of WMT, pp.  1–55, 2020. https://aclanthology.org/2020.wmt-1.1.

  6. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness

  7. Red-teaming large language models using chain of utterances for safety-alignment
  8. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions
  9. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR
  10. The Foundation Model Transparency Index
  11. Foundation Model Transparency Reports
  12. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  13. Multipl-e: A scalable and extensible approach to benchmarking neural code generation
  14. ERNIE-code: Beyond English-centric cross-lingual pretraining for programming languages. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  10628–10650, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.676. https://aclanthology.org/2023.findings-acl.676.

  15. Evaluating large language models trained on code. 2021.
  16. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  17. Training Verifiers to Solve Math Word Problems
  18. Unsupervised cross-lingual representation learning at scale
  19. Teenytinyllama: open-source tiny language models trained in brazilian portuguese
  20. Flashattention: Fast and memory-efficient exact attention with io-awareness
  21. A new massive multilingual dataset for high-performance language technologies
  22. Assessing language model deployment with risk cards
  23. Enhancing chat language models by scaling high-quality instructional conversations
  24. Wikimedia Foundation. Wikimedia downloads. https://dumps.wikimedia.org.

  25. Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages
  26. Airavata: Introducing hindi instruction-tuned llm
  27. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
  28. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  29. EleutherAI/lm-evaluation-harness: v0.3.0, December 2022. https://doi.org/10.5281/zenodo.7413426.

  30. Mart: Improving llm safety with multi-round automatic red-teaming
  31. Exploring paracrawl for document-level neural machine translation
  32. OLMo: Accelerating the Science of Language Models
  33. Continual pre-training of large language models: How to (re)warm your model?
  34. Don’t stop pretraining: Adapt language models to domains and tasks
  35. llm-jp-eval: Automatic evaluation tool for Japanese LLMs llm-jp-eval: 日 本語大規模言語モデルの自動評価ツール. In the 30th Annual Meeting of Japanese Association for Natural Language Processing (NLP2024), 2024. https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/A8-2.pdf.

  36. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics (ACL), pp.  4693–4703, 2021. doi: 10.18653/v1/2021.findings-acl.413. https://aclanthology.org/2021.findings-acl.413.

  37. Measuring massive multitask language understanding, 2021a
  38. Cuad: An expert-annotated nlp dataset for legal contract review, 2021b
  39. How good are gpt models at machine translation? a comprehensive evaluation
  40. Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting
  41. Simple and scalable strategies to continually pre-train large language models
  42. Construction of a Japanese multi-hop QA dataset for QA systems capable of explaining the rationale 根拠を説明可能な質問応答システムのための日本語マルチホップqaデータセット構築. In The 29th Annual Meeting of Japanese Association for Natural Language Processing (NLP2023), pp.  2088–2093, March 2023. https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/Q8-14.pdf.

  43. Camels in a changing climate: Enhancing lm adaptation with tulu 2
  44. Is chatgpt a good translator? yes with gpt-4 as the engine
  45. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. https://aclanthology.org/P17-1147.

  46. Continual Training of Language Models for Few-Shot Learning
  47. Adapting a Language Model While Preserving its General Knowledge
  48. Turning english-centric llms into polyglots: How much multilinguality is needed?
  49. Adam: A method for stochastic optimization
  50. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback
  51. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. ISSN 0027-8424. doi: 10.1073/pnas.1611835114. https://www.pnas.org/content/114/13/3521.

  52. The stack: 3 tb of permissively licensed source code
  53. JGLUE: Japanese general language understanding evaluation. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  2957–2966, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.lrec-1.317.

  54. Openassistant conversations – democratizing large language model alignment
  55. LAION. Oig: the open instruction generalist dataset"
  56. Starcoder: may the source be with you!, 2023a
  57. Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. In Proceedings of the 52nd International Conference on Parallel Processing, ICPP ’23, pp.  766–775, New York, NY, USA, 2023b. Association for Computing Machinery. ISBN 9798400708435. doi: 10.1145/3605573.3605613. https://doi.org/10.1145/3605573.3605613.
  58. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. https://aclanthology.org/2022.acl-long.229.

  59. Few-shot Learning with Multilingual Language Models
  60. A Safe Harbor for AI Evaluation and Red Teaming
  61. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp.  6467–6476
  62. Decoupled weight decay regularization
  63. Starcoder 2 and the stack v2: The next generation
  64. Wizardcoder: Empowering code large language models with evol-instruct
  65. FinGPT: Large generative models for a small language. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  2710–2726, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.164. https://aclanthology.org/2023.emnlp-main.164.

  66. FinGPT: Large generative models for a small language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  2710–2726. Association for Computational Linguistics, 2023b. doi: 10.18653/v1/2023.emnlp-main.164. https://aclanthology.org/2023.emnlp-main.164.

  67. Mixed precision training
  68. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2381–2391, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. https://aclanthology.org/D18-1260.

  69. Prompting with pseudo-code instructions. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  15178–15197, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.939. https://aclanthology.org/2023.emnlp-main.939.

  70. Cross-task generalization via natural language crowdsourcing instructions, 2022a
  71. Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022b.
  72. Lila: A unified benchmark for mathematical reasoning, 2023b
  73. Crosslingual Generalization through Multitask Finetuning
  74. OctoPack: Instruction Tuning Code Large Language Models
  75. Crosslingual generalization through multitask finetuning, 2023b
  76. Auditing LLMs: a three-layered approach. AI and Ethics, May 2023. ISSN 2730-5961. doi: 10.1007/s43681-023-00289-2. http://dx.doi.org/10.1007/s43681-023-00289-2.

  77. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15
  78. Vinallama: Llama-based vietnamese foundation model, 2023a
  79. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
  80. Gpt-4 technical report
  81. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. 07 2019. doi: 10.14618/IDS-PUB-9021.
  82. Gorilla: Large language model connected with massive apis
  83. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  84. HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
  85. Red teaming language models with language models
  86. How multilingual is Multilingual BERT?
  87. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints
  88. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp.  784–789, 2018. doi: 10.18653/V1/P18-2124. https://aclanthology.org/P18-2124/.

  89. Fair enough: How can we develop and assess a fair-compliant dataset for large language models’ training?
  90. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  2001–2010
  91. Mark B Ring. Child: A first step towards continual learning. In Learning to learn, pp.  261–292. Springer
  92. Chatgpt mt: Competitive for high- (but not low-) resource languages
  93. Multilingual and zero-shot is closing in on monolingual web register classification. In Simon Dobnik and Lilja Øvrelid (eds.), Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp.  157–165, Reykjavik, Iceland (Online), May 31–2 June 2021. Linköping University Electronic Press, Sweden. https://aclanthology.org/2021.nodalida-main.16.

  94. Code llama: Open foundation models for code
  95. Bloom: A 176b-parameter open-access multilingual language model
  96. Satoshi Sekine. Development of a question answering system focused on an encyclopedia 百科事典を対象とした質問応答システムの開発. In the 9th Annual Meeting of Japanese Association for Natural Language Processing (NLP2003), pp.  637–640, 2003. https://www.anlp.jp/proceedings/annual_meeting/2003/pdf_dir/C7-6.pdf.

  97. Multilingual instruction tuning with just a pinch of multilinguality
  98. Language models are multilingual chain-of-thought reasoners. In the Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=fR3wGCk-IXp.

  99. mGPT: Few-Shot Learners Go Multilingual
  100. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
  101. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
  102. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
  103. Safety assessment of chinese large language models
  104. Galactica: A Large Language Model for Science
  105. Sebastian Thrun. Lifelong learning algorithms. In Learning to learn, pp.  181–209. Springer
  106. It’s all in the heads: Using attention heads as a baseline for cross-lingual transfer in commonsense reasoning. In Findings of the Association for Computational Linguistics, pp.  3534–3546, 2021. doi: 10.18653/V1/2021.FINDINGS-ACL.310. https://doi.org/10.18653/v1/2021.findings-acl.310.

  107. Together. Redpajama: An open source recipe to reproduce llama training dataset, 04 2023. https://github.com/togethercomputer/RedPajama-Data.

  108. Llama 2: Open foundation and fine-tuned chat models
  109. Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
  110. Helpsteer: Multi-attribute helpfulness dataset for steerlm
  111. PolyLM: An Open Source Polyglot Large Language Model
  112. WhiteHouse. Fact sheet: President biden issues executive order on safe, secure, and trustworthy artificial intelligence, 2023. Accessed: March 13
  113. Wizardlm: Empowering large language models to follow complex instructions
  114. Exclusive supermask subnetwork training for continual learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  569–587, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.36. https://aclanthology.org/2023.findings-acl.36.

  115. Exploring continual learning for code generation models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  782–792, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.68. https://aclanthology.org/2023.acl-short.68.

  116. TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c. https://openreview.net/forum?id=xtaX3WyCj1.

  117. Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages
  118. Investigating Continual Pretraining in Large Language Models: Insights and Implications
  119. Metamath: Bootstrap your own mathematical questions for large language models
  120. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. https://aclanthology.org/P19-1472.

  121. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  3987–3995. JMLR. org
  122. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems, 36
  123. SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions
  124. Multimodal c4: An open, billion-scale corpus of images interleaved with text, 2023a
  125. Extrapolating Large Language Models to Non-English by Aligning Languages
  126. Toolqa: A dataset for llm question answering with external tools
  127. Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
  128. Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models

Show All 128

Test Your Knowledge

You answered out of questions correctly.

Well done!