Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios? (2110.13658v1)

Published 26 Oct 2021 in cs.CL and cs.LG

Abstract: Recent impressive improvements in NLP, largely based on the success of contextual neural LLMs, have been mostly demonstrated on at most a couple dozen high-resource languages. Building LLMs and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based LLM on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based LLMs can be an asset for NLP in low-resource and high language variability set-tings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Arij Riabi (9 papers)
  2. Benoît Sagot (60 papers)
  3. Djamé Seddah (28 papers)
Citations (15)