ALLaM: Large Language Models for Arabic and English (2407.15390v1)
Abstract: We present ALLaM: Arabic LLM, a series of LLMs to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a LLM compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.
- M Saiful Bari (22 papers)
- Yazeed Alnumay (7 papers)
- Norah A. Alzahrani (1 paper)
- Nouf M. Alotaibi (1 paper)
- Hisham A. Alyahya (4 papers)
- Sultan Alrashed (4 papers)
- Faisal A. Mirza (1 paper)
- Shaykhah Z. Alsubaie (1 paper)
- Hassan A. Alahmed (1 paper)
- Ghadah Alabduljabbar (2 papers)
- Raghad Alkhathran (1 paper)
- Yousef Almushayqih (1 paper)
- Raneem Alnajim (2 papers)
- Salman Alsubaihi (4 papers)
- Maryam Al Mansour (1 paper)
- Majed Alrubaian (2 papers)
- Ali Alammari (1 paper)
- Zaki Alawami (1 paper)
- Abdulmohsen Al-Thubaity (1 paper)
- Ahmed Abdelali (21 papers)