GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning (2407.02147v2)
Abstract: LLMs have greatly impacted the NLP field, particularly for the English language. These models have demonstrated capabilities in understanding and generating human-like text. The success of LLMs largely depends on the availability of high-quality instruction datasets, which consist of detailed task descriptions and corresponding responses that are essential for training the models to address a variety of prompts accurately. However, the availability and quality of these resources vary by language. While models perform well in English, they often need help with languages like Arabic, due to the lack of datasets for fine-tuning Arabic-specific tasks. To address this issue, we introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content that covers several domains and instruction types. We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality. Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks. These outcomes emphasize the effectiveness of our dataset in elevating the capabilities of LLMs for Arabic. Our instruction dataset bridges the performance gap between English and Arabic LLMs by providing resources that amplify Arabic NLP development. Building on this foundation, we developed a model, GemmAr-7B-V1, specifically tuned to excel at a wide range of Arabic NLP tasks.
- Comparison of topic identification methods for arabic language. In Proceedings of International Conference on Recent Advances in Natural Language Processing, RANLP, pp. 14–17, 2005.
- Evaluation of topic identification methods on arabic corpora. J. Digit. Inf. Manag., 9(5):185–192, 2011.
- Arabicaqa: A comprehensive dataset for arabic question answering. arXiv preprint arXiv:2403.17848, 2024.
- Doc2vec: An approach to identify hadith similarities. Australian Journal of Basic and Applied Sciences, 14(12):46–53, 2020.
- Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects, 2024. URL https://arxiv.org/abs/2309.07445.
- AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- Armath: a dataset for solving arabic math word problems. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 351–362, 2022.
- Alghafa evaluation benchmark for arabic language models. In Proceedings of ArabicNLP 2023, pp. 244–275, 2023.
- 101 billion arabic words dataset, 2024. URL https://arxiv.org/abs/2405.01590.
- Cidar: Culturally relevant instruction dataset for arabic, 2024.
- Aqad: 17,000+ arabic questions for machine comprehension of text. 2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), pp. 1–6, 2020. URL https://api.semanticscholar.org/CorpusID:231618331.
- Promptsource: An integrated development environment and repository for natural language prompts, 2022. URL https://arxiv.org/abs/2202.01279.
- Arabic reading comprehension benchmarks created semiautomatically. In 2020 21st International Arab Conference on Information Technology (ACIT), pp. 1–6. IEEE, 2020.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020.
- Analyzing multilingual competency of llms in multi-turn instruction following: A case study of arabic, 2023. URL https://arxiv.org/abs/2310.14819.
- Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm. Company Blog of Databricks, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/abs/2205.14135.
- Sanad: Single-label arabic news articles dataset for automatic text categorization. Data in brief, 25:104076, 2019.
- M. EL-Haj. Arabic in business and management corpora (abmc)., 2016. URL http://www.lancaster.ac.uk/staff/elhaj/corpora.htm.
- Ibrahim Abu El-Khair. Abu el-khair corpus: A modern standard arabic corpus. International Journal of Recent Trends in Engineering & Research (IJRTER), 2(11):5–13, 2016.
- Open arabic llm leaderboard. https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard, 2024.
- Brad 1.0: Book reviews in arabic dataset. In 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), pp. 1–8. IEEE, 2016.
- Building large arabic multi-domain resources for sentiment analysis. In International conference on intelligent text processing and computational linguistics, pp. 23–34. Springer, 2015.
- A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 10:8–9, 2021.
- The false promise of imitating proprietary llms, 2023. URL https://arxiv.org/abs/2305.15717.
- Parameter-efficient fine-tuning for large models: A comprehensive survey, 2024. URL https://arxiv.org/abs/2403.14608.
- Exams: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. arXiv preprint arXiv:2011.03080, 2020.
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp. 4411–4421. PMLR, 2020.
- Acegpt, localizing large language models in arabic. arXiv preprint arXiv:2309.12053, 2023.
- Dawqas: A dataset for arabic why question answering system. Procedia Computer Science, 142:123–131, 2018. ISSN 1877-0509. https://doi.org/10.1016/j.procs.2018.10.467. URL https://www.sciencedirect.com/science/article/pii/S1877050918321690. Arabic Computational Linguistics.
- Mistral 7b, 2023.
- Challenges and applications of large language models, 2023. URL https://arxiv.org/abs/2307.10169.
- Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Teaching llama a new language through cross-lingual knowledge transfer. arXiv preprint arXiv:2404.04042, 2024.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
- Open multilingual llm evaluation leaderboard, 2023.
- Adapting large language models for education: Foundational capabilities, potentials, and challenges, 2024. URL https://arxiv.org/abs/2401.08664.
- Alpacaeval: An automatic evaluator of instruction-following models, 2023.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pp. 22631–22648. PMLR, 2023.
- Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/1711.05101.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Cross-task generalization via natural language crowdsourcing instructions, 2022. URL https://arxiv.org/abs/2104.08773.
- Neural arabic question answering. arXiv preprint arXiv:1906.05394, 2019.
- A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435, 2023.
- Instruction in the wild: A user-based instruction dataset. https://github.com/XueFuzhao/InstructionWild, 2023.
- Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
- Classical arabic poetry: Classification based on era. In 2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), pp. 1–6. IEEE, 2020.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Bimedix: Bilingual medical mixture of experts llm. arXiv preprint arXiv:2402.13253, 2024.
- Multilingual event linking to Wikidata. In Akari Asai, Eunsol Choi, Jonathan H. Clark, Junjie Hu, Chia-Hsuan Lee, Jungo Kasai, Shayne Longpre, Ikuya Yamada, and Rui Zhang (eds.), Proceedings of the Workshop on Multilingual Information Access (MIA), pp. 37–58, Seattle, USA, July 2022. Association for Computational Linguistics. 10.18653/v1/2022.mia-1.5. URL https://aclanthology.org/2022.mia-1.5.
- Empirical analysis of the strengths and weaknesses of peft techniques for llms. arXiv preprint arXiv:2304.14999, 2023.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI spring symposium series, 2011.
- Multitask prompted training enables zero-shot task generalization, 2022. URL https://arxiv.org/abs/2110.08207.
- Aya dataset: An open-access collection for multilingual instruction tuning, 2024. URL https://arxiv.org/abs/2402.06619.
- Roformer: Enhanced transformer with rotary position embedding, 2023. URL https://arxiv.org/abs/2104.09864.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024a.
- Gemma: Open models based on gemini research and technology, 2024b. URL https://arxiv.org/abs/2403.08295.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Openchat: Advancing open-source language models with mixed-quality data, 2024. URL https://arxiv.org/abs/2309.11235.
- Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Finetuned language models are zero-shot learners, 2022. URL https://arxiv.org/abs/2109.01652.
- Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017.
- A comprehensive capability analysis of gpt-3 and gpt-3.5 series models, 2023. URL https://arxiv.org/abs/2303.10420.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models, 2023a. URL https://arxiv.org/abs/2306.10968.
- Auto-instruct: Automatic instruction generation and ranking for black-box language models. arXiv preprint arXiv:2310.13127, 2023b.
- Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics. URL http://arxiv.org/abs/2403.13372.
- Hasna Chouikhi (2 papers)
- Manel Aloui (2 papers)
- Cyrine Ben Hammou (1 paper)
- Ghaith Chaabane (2 papers)
- Haithem Kchaou (2 papers)
- Chehir Dhaouadi (2 papers)