Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AfroLID: A Neural Language Identification Tool for African Languages (2210.11744v3)

Published 21 Oct 2022 in cs.CL and cs.LG

Abstract: Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ife Adebara (12 papers)
  2. AbdelRahim Elmadany (33 papers)
  3. Muhammad Abdul-Mageed (102 papers)
  4. Alcides Alcoba Inciarte (7 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.