Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Open Dataset and Model for Language Identification (2305.13820v1)

Published 23 May 2023 in cs.CL

Abstract: Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, the reliability of which we ensure by auditing a sample from each source and each language manually. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model's performance, both in comparison to existing open models and by language class.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Laurie Burchell (6 papers)
  2. Alexandra Birch (67 papers)
  3. Nikolay Bogoychev (17 papers)
  4. Kenneth Heafield (24 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.