Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 .
Aurora-M is a 15B parameter open-source multilingual Large Language Model (LLM) pre-trained on over 2 trillion tokens, encompassing languages like English, Finnish, Hindi, Japanese, Vietnamese, and code, aligning with the Biden-Harris Executive Order on AI safety.
The model underwent a two-stage training curriculum, using diverse datasets and focusing on both general and instruction-tuned datasets to enhance its capabilities and safety alignment.
Training techniques included the use of the LUMI supercomputer, mixed precision training, and a meticulously optimized learning schedule, prioritizing environmental considerations like hydro-powered energy use.
Aurora-M demonstrated superior performance in multilingual tasks and coding-related tasks, particularly in safety evaluations, underlining its commitment to producing ethically and legally sound content.
The paper introduces Aurora-M, a 15B parameter open-source multilingual Large Language Model (LLM) that has been continually pretrained on a diverse and extensive dataset. Unlike its predecessors, Aurora-M stands out not only for its multilingual capabilities, which cover English, Finnish, Hindi, Japanese, Vietnamese, and code, but also for its alignment with stringent AI safety and legal standards, specifically the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. The model was continually pretrained from the StarCoderPlus model on an additional 435 billion tokens, reaching a staggering total of over 2 trillion tokens. This comprehensive training enables Aurora-M to demonstrate robustness against catastrophic forgetting and superior performance in multilingual settings, particularly in safety evaluations.
The dataset preparation for Aurora-M involved a two-stage training curriculum, integrating general text data from diverse sources, covering both natural languages and coding languages, along with instruction-tuning datasets. The Continual Auxiliary Pretraining (CAP) stage utilized general web data and multilingual datasets from sources like RefinedWeb and the Pile, while the Continual Alignment Tuning (CAT) stage focused on further boosting its capabilities in specific areas and aligning with safety objectives. Rigorous data filtering techniques were employed to ensure the high quality and relevance of training data, addressing challenges like toxic content removal and sensitive information anonymization.
Aurora-M's training exploited advanced techniques, including the use of the LUMI supercomputer, mixed precision training, and a carefully optimized learning rate schedule, culminating in a training period of 48 days. This training was not only highly efficient but also environmentally considerate, using 100% hydro-powered energy and incorporating waste heat recycling.
A critical aspect of Aurora-M's development was its instruction-tuning on a carefully curated dataset designed to align with the Biden-Harris Executive Order’s focus areas. This safety consideration is crucial for mitigating risks related to AI applications and ensuring the model’s outputs adhere to accepted ethical and legal standards. The construction of this tailored safety dataset underscores a proactive approach to addressing contemporary concerns regarding AI safety and compliance.
Aurora-M was subjected to comprehensive evaluations across a range of tasks and languages. Its performance was benchmarked against leading models, showcasing its enhanced capabilities in multilingual language understanding and generation, as well as in coding-related tasks. Notably, Aurora-M demonstrated superior performance in safety evaluations, affirming its commitment to producing legally compliant and ethically sound content.
The development of Aurora-M represents a significant step forward in the field of AI research, particularly in fostering open-source LLM development. The model's release is intended to encourage further research and innovation, with its underlying datasets and training methodologies made accessible for community refinement and expansion. Looking ahead, there are plans to explore continual training of Aurora-M on advanced base models and expand its domain-specific expertise, leveraging the insights gained from this project to push the boundaries of AI capabilities while maintaining a steadfast commitment to safety and legal compliance.
In conclusion, Aurora-M embodies a harmonious blend of technical excellence, multilingual inclusivity, and unwavering commitment to safety and ethical AI development. Its introduction paves the way for further advancements in LLM research and applications, promising wider accessibility and responsible innovation in the AI domain.
Findings of the 2020 conference on machine translation (WMT20). In Proceedings of WMT, pp. 1–55, 2020. https://aclanthology.org/2020.wmt-1.1.
A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness
ERNIE-code: Beyond English-centric cross-lingual pretraining for programming languages. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 10628–10650, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.676. https://aclanthology.org/2023.findings-acl.676.
Wikimedia Foundation. Wikimedia downloads. https://dumps.wikimedia.org.
EleutherAI/lm-evaluation-harness: v0.3.0, December 2022. https://doi.org/10.5281/zenodo.7413426.
llm-jp-eval: Automatic evaluation tool for Japanese LLMs llm-jp-eval: 日 本語大規模言語モデルの自動評価ツール. In the 30th Annual Meeting of Japanese Association for Natural Language Processing (NLP2024), 2024. https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/A8-2.pdf.
XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics (ACL), pp. 4693–4703, 2021. doi: 10.18653/v1/2021.findings-acl.413. https://aclanthology.org/2021.findings-acl.413.
Construction of a Japanese multi-hop QA dataset for QA systems capable of explaining the rationale 根拠を説明可能な質問応答システムのための日本語マルチホップqaデータセット構築. In The 29th Annual Meeting of Japanese Association for Natural Language Processing (NLP2023), pp. 2088–2093, March 2023. https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/Q8-14.pdf.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. https://aclanthology.org/P17-1147.
Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. ISSN 0027-8424. doi: 10.1073/pnas.1611835114. https://www.pnas.org/content/114/13/3521.
JGLUE: Japanese general language understanding evaluation. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2957–2966, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.lrec-1.317.
TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. https://aclanthology.org/2022.acl-long.229.
FinGPT: Large generative models for a small language. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2710–2726, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.164. https://aclanthology.org/2023.emnlp-main.164.
FinGPT: Large generative models for a small language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2710–2726. Association for Computational Linguistics, 2023b. doi: 10.18653/v1/2023.emnlp-main.164. https://aclanthology.org/2023.emnlp-main.164.
Can a suit of armor conduct electricity? A new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. https://aclanthology.org/D18-1260.
Prompting with pseudo-code instructions. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 15178–15197, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.939. https://aclanthology.org/2023.emnlp-main.939.
Auditing LLMs: a three-layered approach. AI and Ethics, May 2023. ISSN 2730-5961. doi: 10.1007/s43681-023-00289-2. http://dx.doi.org/10.1007/s43681-023-00289-2.
Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 784–789, 2018. doi: 10.18653/V1/P18-2124. https://aclanthology.org/P18-2124/.
Multilingual and zero-shot is closing in on monolingual web register classification. In Simon Dobnik and Lilja Øvrelid (eds.), Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 157–165, Reykjavik, Iceland (Online), May 31–2 June 2021. Linköping University Electronic Press, Sweden. https://aclanthology.org/2021.nodalida-main.16.
Satoshi Sekine. Development of a question answering system focused on an encyclopedia 百科事典を対象とした質問応答システムの開発. In the 9th Annual Meeting of Japanese Association for Natural Language Processing (NLP2003), pp. 637–640, 2003. https://www.anlp.jp/proceedings/annual_meeting/2003/pdf_dir/C7-6.pdf.
Language models are multilingual chain-of-thought reasoners. In the Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=fR3wGCk-IXp.
It’s all in the heads: Using attention heads as a baseline for cross-lingual transfer in commonsense reasoning. In Findings of the Association for Computational Linguistics, pp. 3534–3546, 2021. doi: 10.18653/V1/2021.FINDINGS-ACL.310. https://doi.org/10.18653/v1/2021.findings-acl.310.
Together. Redpajama: An open source recipe to reproduce llama training dataset, 04 2023. https://github.com/togethercomputer/RedPajama-Data.
Exclusive supermask subnetwork training for continual learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 569–587, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.36. https://aclanthology.org/2023.findings-acl.36.
Exploring continual learning for code generation models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 782–792, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.68. https://aclanthology.org/2023.acl-short.68.
TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c. https://openreview.net/forum?id=xtaX3WyCj1.
HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. https://aclanthology.org/P19-1472.