Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus (2010.14571v2)

Published 27 Oct 2020 in cs.CL and cs.LG

Abstract: Large text corpora are increasingly important for a wide variety of NLP tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Isaac Caswell (19 papers)
  2. Theresa Breiner (7 papers)
  3. Daan van Esch (11 papers)
  4. Ankur Bapna (53 papers)
Citations (81)

Summary

We haven't generated a summary for this paper yet.