Automatic Language Identification in Texts: A Survey (1804.08186v2)

Published 22 Apr 2018 in cs.CL

Abstract: Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used so far in the LI literature. For describing the features and methods we introduce a unified notation. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.

Citations (191)

View on Semantic Scholar

Summary

The paper surveys over fifty years of automatic language identification research, detailing historical context, features, models, evaluation practices, challenges, and future directions.
It categorizes features like character combinations and words, and reviews models ranging from classical statistics to deep learning approaches used in LangID.
The survey discusses key applications in NLP and highlights challenges like handling multilingual texts, short sequences, and low-resource languages.

Overview of "Automatic Language Identification in Texts: A Survey"

The paper entitled "Automatic Language Identification in Texts: A Survey" presents a comprehensive historical overview and extensive survey of research in the field of Automatic Language Identification (LangID) over a span of more than fifty years. This topic is critical within the field of NLP, as many NLP systems require knowledge of the input language to function effectively. This survey seeks to present a unified documentation of the various approaches, features, algorithms, and challenges associated with LangID, while also identifying open research questions and proposing directions for future research.

Key Contributions and Methodologies

The survey is structured to provide clarity in studying and understanding different LangID methodologies and is particularly insightful given the diversity of frameworks employed in this area. It provides:

Historical Context: The survey begins with a historical account of LangID research, starting with early non-computational methods and moving through decades of evolving computational approaches to distinguish between vast numbers of languages.
Features and Text Representation: The paper categorizes and describes various types of features used in LangID, such as characters, character combinations, syllables, morphemes, and words. It provides mathematical formulations for calculating feature values and showcases their application. This detailed breakdown assists in understanding the underlying mechanics of LangID systems. Notably, the survey emphasizes the challenges presented by multilingual texts and short text sequences.
Models and Methods: The paper reviews an extensive range of models and techniques, from classical statistical methods and rule-based systems to state-of-the-art machine learning and deep learning approaches, such as Naive Bayes (NB) classifiers, Support Vector Machines (SVMs), Decision Trees (DTs), and Neural Networks (NNs). Each method's strengths and the contexts in which they perform best are discussed thoroughly.
Evaluation Practices: Insights into the variety of evaluation methods are provided, including information about shared tasks that have served as benchmarks for comparing the LangID systems. This survey highlights the fact that while accuracy under controlled conditions (e.g., European languages in structured documents) might be high, challenges arise when dealing with real-world data involving code-switching, short text segments, or low-resource languages.
Applications: The importance of LangID in practical applications such as machine translation, information retrieval, NLP pipelines, and corpus construction is covered. The paper underscores LangID's role in enhancing broader processing tasks, thereby emphasizing the technology's importance in the entire computational linguistic landscape.
Challenges and Future Directions: Several open issues are identified, including segmentation of multilingual documents, handling non-standard language use, dealing with textual noise, and supporting under-resourced languages. Future research directions are proposed, particularly emphasizing the need for robust models that can deal with real-world data complexities, such as document noise, script rotation, and the dynamic nature of modern internet-based communication.

In summary, the survey by Jauhiainen et al. offers a substantive reference for researchers in the domain, providing insights not only into historical and contemporary approaches to LangID but also speculating on the impact of emerging AI methods and identifying areas ripe for further research. It is an indispensable resource that underscores the complexity and the pivotal nature of language identification within the field of NLP.

Automatic Language Identification in Texts: A Survey (1804.08186v2)

Summary

Overview of "Automatic Language Identification in Texts: A Survey"

Related Papers