Introducing Aurora-M: A Multilingual Open-Source LLM Compliant with the Biden-Harris Executive Order on AI Safety
Overview of Aurora-M
The paper introduces Aurora-M, a 15B parameter open-source multilingual LLM that has been continually pretrained on a diverse and extensive dataset. Unlike its predecessors, Aurora-M stands out not only for its multilingual capabilities, which cover English, Finnish, Hindi, Japanese, Vietnamese, and code, but also for its alignment with stringent AI safety and legal standards, specifically the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. The model was continually pretrained from the StarCoderPlus model on an additional 435 billion tokens, reaching a staggering total of over 2 trillion tokens. This comprehensive training enables Aurora-M to demonstrate robustness against catastrophic forgetting and superior performance in multilingual settings, particularly in safety evaluations.
Data Curation and Processing
The dataset preparation for Aurora-M involved a two-stage training curriculum, integrating general text data from diverse sources, covering both natural languages and coding languages, along with instruction-tuning datasets. The Continual Auxiliary Pretraining (CAP) stage utilized general web data and multilingual datasets from sources like RefinedWeb and the Pile, while the Continual Alignment Tuning (CAT) stage focused on further boosting its capabilities in specific areas and aligning with safety objectives. Rigorous data filtering techniques were employed to ensure the high quality and relevance of training data, addressing challenges like toxic content removal and sensitive information anonymization.
Training Methodology
Aurora-M's training exploited advanced techniques, including the use of the LUMI supercomputer, mixed precision training, and a carefully optimized learning rate schedule, culminating in a training period of 48 days. This training was not only highly efficient but also environmentally considerate, using 100% hydro-powered energy and incorporating waste heat recycling.
Emphasis on Safety and Legal Compliance
A critical aspect of Aurora-M's development was its instruction-tuning on a carefully curated dataset designed to align with the Biden-Harris Executive Order’s focus areas. This safety consideration is crucial for mitigating risks related to AI applications and ensuring the model’s outputs adhere to accepted ethical and legal standards. The construction of this tailored safety dataset underscores a proactive approach to addressing contemporary concerns regarding AI safety and compliance.
Evaluation and Performance
Aurora-M was subjected to comprehensive evaluations across a range of tasks and languages. Its performance was benchmarked against leading models, showcasing its enhanced capabilities in multilingual language understanding and generation, as well as in coding-related tasks. Notably, Aurora-M demonstrated superior performance in safety evaluations, affirming its commitment to producing legally compliant and ethically sound content.
Contributions and Future Directions
The development of Aurora-M represents a significant step forward in the field of AI research, particularly in fostering open-source LLM development. The model's release is intended to encourage further research and innovation, with its underlying datasets and training methodologies made accessible for community refinement and expansion. Looking ahead, there are plans to explore continual training of Aurora-M on advanced base models and expand its domain-specific expertise, leveraging the insights gained from this project to push the boundaries of AI capabilities while maintaining a steadfast commitment to safety and legal compliance.
In conclusion, Aurora-M embodies a harmonious blend of technical excellence, multilingual inclusivity, and unwavering commitment to safety and ethical AI development. Its introduction paves the way for further advancements in LLM research and applications, promising wider accessibility and responsible innovation in the AI domain.