Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AlcLaM: Arabic Dialectal Language Model (2407.13097v1)

Published 18 Jul 2024 in cs.CL

Abstract: Pre-trained LLMs (PLMs) are integral to many modern NLP systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at GitHub https://github.com/amurtadha/Alclam and HuggingFace https://huggingface.co/rahbi.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Murtadha Ahmed (8 papers)
  2. Saghir Alfasly (12 papers)
  3. Bo Wen (40 papers)
  4. Jamaal Qasem (1 paper)
  5. Mohammed Ahmed (5 papers)
  6. Yunfeng Liu (19 papers)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub