Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ArabianGPT: Native Arabic GPT-based Large Language Model (2402.15313v2)

Published 23 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The predominance of English and Latin-based LLMs has led to a notable deficit in native Arabic LLMs. This discrepancy is accentuated by the prevalent inclusion of English tokens in existing Arabic models, detracting from their efficacy in processing native Arabic's intricate morphology and syntax. Consequently, there is a theoretical and practical imperative for developing LLMs predominantly focused on Arabic linguistic elements. To address this gap, this paper proposes ArabianGPT, a series of transformer-based models within the ArabianLLM suite designed explicitly for Arabic. These models, including ArabianGPT-0.1B and ArabianGPT-0.3B, vary in size and complexity, aligning with the nuanced linguistic characteristics of Arabic. The AraNizer tokenizer, integral to these models, addresses the unique morphological aspects of Arabic script, ensuring more accurate text processing. Empirical results from fine-tuning the models on tasks like sentiment analysis and summarization demonstrate significant improvements. For sentiment analysis, the fine-tuned ArabianGPT-0.1B model achieved a remarkable accuracy of 95%, a substantial increase from the base model's 56%. Similarly, in summarization tasks, fine-tuned models showed enhanced F1 scores, indicating improved precision and recall in generating concise summaries. Comparative analysis of fine-tuned ArabianGPT models against their base versions across various benchmarks reveals nuanced differences in performance, with fine-tuning positively impacting specific tasks like question answering and summarization. These findings underscore the efficacy of fine-tuning in aligning ArabianGPT models more closely with specific NLP tasks, highlighting the potential of tailored transformer architectures in advancing Arabic NLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Improving language understanding by generative pre-training. In Advances in neural information processing systems, volume 30, 2018.
  2. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  3. hulmona: The universal language model in arabic. In Proceedings of the fourth Arabic natural language processing workshop, pages 68–77, August 2019.
  4. Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104, 2020.
  5. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pages 1877–1901, 2020.
  6. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149, 2023.
  7. Acegpt, localizing large language models in arabic. arXiv preprint arXiv:2309.12053, 2023.
  8. An empirical study of pre-trained transformers for arabic information extraction. arXiv preprint arXiv:2004.14519, 2020.
  9. Arbert & marbert: deep bidirectional transformers for arabic. arXiv preprint arXiv:2101.01785, 2020.
  10. Araelectra: Pre-training text discriminators for arabic language understanding. arXiv preprint arXiv:2012.15516, 2020.
  11. The interplay of variant, size, and task type in arabic pre-trained language models. arXiv preprint arXiv:2103.06678, 2021.
  12. Arat5: Text-to-text transformers for arabic language generation. arXiv preprint arXiv:2109.12068, 2021.
  13. Pre-training bert on arabic tweets: Practical considerations. arXiv preprint arXiv:2102.10684, 2021.
  14. Revisiting pre-trained language models and their evaluation for arabic natural language processing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3135–3151, 2022.
  15. Aranizer 0.1.8. https://pypi.org/project/aranizer, 2023.
  16. T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  17. Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999.
  18. Attention is all you need. In Advances in neural information processing systems, volume 30, 2017.
  19. CSEBUETNLP. XL-Sum Arabic Dataset. https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/arabic, 2024. Accessed on: 25/01/2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Anis Koubaa (49 papers)
  2. Adel Ammar (24 papers)
  3. Lahouari Ghouti (3 papers)
  4. Omar Najar (3 papers)
  5. Serry Sibaee (9 papers)
Citations (2)
Youtube Logo Streamline Icon: https://streamlinehq.com