GlotLID: Language Identification for Low-Resource Languages (2310.16248v3)

Published 24 Oct 2023 in cs.CL

Abstract: Several papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model (including future versions), code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

Authors (4)

Amir Hossein Kargaran (16 papers)
Ayyoob Imani (16 papers)
François Yvon (49 papers)
Hinrich Schütze (250 papers)

Citations (6)

View on Semantic Scholar

Summary

An Overview of GlotLID: Language Identification for Low-Resource Languages

The paper introduces GlotLID-M, a highly comprehensive language identification (LID) model specifically designed to address the challenges and gaps in existing LID systems, particularly for low-resource languages. This research stands out by significantly expanding the coverage to 1,665 languages, surpassing many existing models in both scope and efficiency.

Model Design and Evaluation

GlotLID-M leverages the FastText architecture, chosen for its balance between performance, ease of use, and capability to handle a vast number of languages. FastText’s multinomial logistic classification aids in efficient processing and provides well-calibrated confidence scores, critical for filtering noisy data in real-world applications.

Key to the model's robustness is the dataset GlotLID-C, which encompasses data from a wide array of sources, increasing reliability and reducing domain bias. This corpus consists of 289 million sentences across 1,832 languages, from which a subset is used for training to produce high-quality and less contaminated corpora for low-resource languages.

GlotLID-M was evaluated against notable baselines including CLD3, FT176, OpenLID, and NLLB across several datasets like FLORES-200 and UDHR. The results demonstrate that GlotLID-M consistently outperforms these baselines, especially in realistic settings where the set of languages is unknown, a scenario referred to as the SET? evaluation.

Addressing Challenges in LID

The paper identifies several unique challenges associated with LID for low-resource languages such as incorrect metadata, influence from high-resource languages, and difficulties in distinguishing closely related languages. The authors have meticulously curated GlotLID-C to mitigate these issues through stringent source selection and preprocessing.

Moreover, the paper highlights the complexity of supporting both macrolanguages and their varieties within the LID framework, emphasizing the need for flexible granularity in language labeling.

Implications and Future Prospects

The development of GlotLID-M has substantial practical implications for improving data quality in NLP pipelines, particularly for underrepresented languages. By providing a reliable tool for language identification, GlotLID-M facilitates the creation of higher-quality datasets, essential for advancing NLP research and applications in diverse linguistic contexts.

The paper also opens avenues for future work, suggesting the enhancement of training data quality, further expansion of language coverage, and the development of methodologies to address issues identified during the error analysis of certain low-resource language corpora produced by GlotLID-M.

Conclusion

GlotLID-M represents a significant step forward in the field of language identification for low-resource languages. By providing broad coverage, efficient performance, and a robust evaluation framework, this model serves as a pivotal resource for supporting multilingual NLP technologies. The open-source nature of GlotLID-M underscores the authors' commitment to fostering collaboration and continuous improvement within the research community.

PDF Markdown