Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

Published 30 Mar 2024 in cs.CL, cs.AI, and cs.LG | (2404.00399v3)

Abstract: Pretrained LLMs are an integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of LLMs at https://huggingface.co/aurora-m.

Abstract PDF HTML Upgrade to Chat

References (128)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces Aurora-M, a 15B parameter multilingual LLM pretrained on over 2 trillion tokens and designed to comply with Biden-Harris AI safety standards.
It employs a two-stage training curriculum with advanced techniques like mixed precision training and supercomputer acceleration to optimize efficiency.
The model outperforms benchmarks in safety and multilingual evaluations, fostering open-source innovation and ethical AI development.

Introducing Aurora-M: A Multilingual Open-Source LLM Compliant with the Biden-Harris Executive Order on AI Safety

Overview of Aurora-M

The paper introduces Aurora-M, a 15B parameter open-source multilingual LLM that has been continually pretrained on a diverse and extensive dataset. Unlike its predecessors, Aurora-M stands out not only for its multilingual capabilities, which cover English, Finnish, Hindi, Japanese, Vietnamese, and code, but also for its alignment with stringent AI safety and legal standards, specifically the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. The model was continually pretrained from the StarCoderPlus model on an additional 435 billion tokens, reaching a staggering total of over 2 trillion tokens. This comprehensive training enables Aurora-M to demonstrate robustness against catastrophic forgetting and superior performance in multilingual settings, particularly in safety evaluations.

Data Curation and Processing

The dataset preparation for Aurora-M involved a two-stage training curriculum, integrating general text data from diverse sources, covering both natural languages and coding languages, along with instruction-tuning datasets. The Continual Auxiliary Pretraining (CAP) stage utilized general web data and multilingual datasets from sources like RefinedWeb and the Pile, while the Continual Alignment Tuning (CAT) stage focused on further boosting its capabilities in specific areas and aligning with safety objectives. Rigorous data filtering techniques were employed to ensure the high quality and relevance of training data, addressing challenges like toxic content removal and sensitive information anonymization.

Training Methodology

Aurora-M's training exploited advanced techniques, including the use of the LUMI supercomputer, mixed precision training, and a carefully optimized learning rate schedule, culminating in a training period of 48 days. This training was not only highly efficient but also environmentally considerate, using 100% hydro-powered energy and incorporating waste heat recycling.

Emphasis on Safety and Legal Compliance

A critical aspect of Aurora-M's development was its instruction-tuning on a carefully curated dataset designed to align with the Biden-Harris Executive Order’s focus areas. This safety consideration is crucial for mitigating risks related to AI applications and ensuring the model’s outputs adhere to accepted ethical and legal standards. The construction of this tailored safety dataset underscores a proactive approach to addressing contemporary concerns regarding AI safety and compliance.

Evaluation and Performance

Aurora-M was subjected to comprehensive evaluations across a range of tasks and languages. Its performance was benchmarked against leading models, showcasing its enhanced capabilities in multilingual language understanding and generation, as well as in coding-related tasks. Notably, Aurora-M demonstrated superior performance in safety evaluations, affirming its commitment to producing legally compliant and ethically sound content.

Contributions and Future Directions

The development of Aurora-M represents a significant step forward in the field of AI research, particularly in fostering open-source LLM development. The model's release is intended to encourage further research and innovation, with its underlying datasets and training methodologies made accessible for community refinement and expansion. Looking ahead, there are plans to explore continual training of Aurora-M on advanced base models and expand its domain-specific expertise, leveraging the insights gained from this project to push the boundaries of AI capabilities while maintaining a steadfast commitment to safety and legal compliance.

In conclusion, Aurora-M embodies a harmonious blend of technical excellence, multilingual inclusivity, and unwavering commitment to safety and ethical AI development. Its introduction paves the way for further advancements in LLM research and applications, promising wider accessibility and responsible innovation in the AI domain.

Markdown