Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic (2412.04277v1)

Published 5 Dec 2024 in cs.CL

Abstract: LLMs have shown impressive results in multiple domains of NLP but are mainly focused on the English language. Recently, more LLMs have incorporated a larger proportion of multilingual text to represent low-resource languages. In Arabic NLP, several Arabic-centric LLMs have shown remarkable results on multiple benchmarks in the past two years. However, most Arabic LLMs have more than 7 billion parameters, which increases their hardware requirements and inference latency, when compared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base and chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable LM 1.6B chat model achieves impressive results on several benchmarks beating multiple models with up to 8x the parameters. In addition, we show the benefit of mixing in synthetic instruction tuning data by augmenting our fine-tuning data with a large synthetic dialogue dataset.

Authors (11)

Zaid Alyafeai (21 papers)
Michael Pieler (10 papers)
Hannah Teufel (7 papers)
Jonathan Tow (7 papers)
Marco Bellagente (13 papers)
Duy Phung (9 papers)
Nikhil Pinnaparaju (5 papers)
Reshinth Adithyan (4 papers)
Paulo Rocha (8 papers)
Maksym Zhuravinskyi (6 papers)
Carlos Riquelme (26 papers)

Summary

Overview of "Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic"

Abstract

The researchers introduce the Arabic Stable LM 1.6B, an adaptation of the Stable LM 2 1.6B model fine-tuned to accommodate the Arabic language. In contrast to existing Arabic-centric LLMs that usually surpass 7 billion parameters, the proposed model is significantly smaller while maintaining competitive performance against models considerably larger in scale. This work demonstrates the potential of incorporating synthetic instruction tuning data to enhance the model's capabilities further.

Introduction

The paper emphasizes the current focus of LLMs predominantly on English, with recent endeavors aiming to incorporate multilingual capabilities, particularly for low-resource languages such as Arabic. Despite the recent advancements in Arabic-centric LLMs, the researchers identified a gap in exploring smaller and more efficient models. The Arabic Stable LM 1.6B model is designed to provide state-of-the-art performance in this compact form factor, making it more accessible in terms of hardware and computational efficiency.

Methodology

The Arabic Stable LM 1.6B is an extension of the Stable LM 2 1.6B model, fine-tuned using over 100 billion Arabic text tokens. The training process incorporated both multilingual and Arabic-specific data sets, including CulturaX, SANAD, and an Arabic E-Book corpus. The data underwent rigorous filtering and cleaning to ensure optimal quality for model training. A significant innovation in this research was using a synthetic instruction-tuning dataset generated through LLM-based text rephrasing, enriching the fine-tuning data with productivity-focused dialogue datasets.

Evaluation

The model was evaluated across several Arabic benchmarks, encompassing cultural alignment and natural language understanding tasks. The performance was compared both in terms of different model sizes and across various evaluation frameworks like ArabicMMLU, CIDAR-MCQ-100, ACVA, and AlGhafa. Notably, the Arabic Stable LM 1.6B achieved results that were on par or superior to models with up to eight times the number of parameters, highlighting its efficiency.

Results and Discussion

The Arabic Stable LM 1.6B outperformed many existing models, particularly on nuanced tasks related to Arabic cultural alignment and language understanding. Through an analysis of different learning rate schedules during pre-training, the researchers identified the early cool down schedule as more effective. Meanwhile, evaluation in cloze format (CF) yielded more reliable results compared to multiple-choice format (MCF), reinforcing the robustness of the model in practical deployments.

Limitations and Future Work

The paper acknowledges several limitations, including the model's high fertility rate resulting from the pre-trained tokenizer used, which may affect inference throughput. The paucity of evaluation benchmarks for Arabic, especially for complex setups, also poses challenges in comprehensive evaluation. The authors suggest exploring advanced techniques for quality filtering of synthetic data and further research on efficient tokenizer transfer methods to optimize performance.

Conclusion

This research demonstrates a substantial advancement in developing efficient Arabic NLP models, proving that a smaller-sized model can achieve remarkable performance on par with larger architectures by leveraging optimal data processing and innovative fine-tuning techniques. The findings suggest significant implications for efficient deployment in resource-constrained environments and lay the groundwork for future enhancements in multilingual LLMing, particularly for low-resource languages.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/arxivsanitybot/status/1865951250497327573