A Survey on Text Classification: From Shallow to Deep Learning (2008.00364v6)

Published 2 Aug 2020 in cs.CL

Abstract: Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021, focusing on models from traditional models to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.

Citations (262)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey covering six decades of text classification methods, highlighting the shift from manual feature extraction to integrated deep learning approaches.
It details the evolution from traditional models like SVM and Naïve Bayes to modern architectures including CNNs, RNNs, and pretrained models such as BERT.
The paper also discusses benchmark datasets, evaluation metrics, and future challenges, offering actionable insights to advance NLP research.

A Survey on Text Classification: From Traditional to Deep Learning

The paper "A Survey on Text Classification: From Traditional to Deep Learning" offers a comprehensive overview of the evolution of text classification methodologies from traditional machine learning techniques to modern deep learning approaches. Authored by Qian Li et al., it explores the trajectory of text classification research over six decades, presenting a detailed taxonomy of text classification models, datasets, and evaluation metrics.

At the outset, the paper highlights the foundational role of text classification in various NLP tasks, such as sentiment analysis, topic labeling, and question answering. It emphasizes the challenges posed by manual text classification due to its susceptibility to human error and inefficiency, advocating for machine learning methods to automate this process for more reliable results.

Traditional Models

The survey begins with an analysis of traditional text classification models, which dominated the field until the 2010s. These models, such as Naïve Bayes (NB), K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Decision Trees (DT), and ensemble methods like Random Forest (RF), relied heavily on feature extraction techniques. The effectiveness of these models was frequently constrained by the specificity and quality of manually extracted features. The paper covers both the advantages, such as precision and stability, and the limitations, like the need for feature engineering and failure to capture text semantics effectively.

Deep Learning Models

Transitioning into the era of deep learning, the paper details how models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and attention mechanisms have addressed many limitations of traditional methods. These models integrate feature extraction with the classification process through end-to-end learning frameworks, enabling the capture of semantic information directly from raw text data. The authors present a detailed review of these models, noting significant improvements in classification accuracy and the ability to extract relevant features without manual intervention.

The paper especially emphasizes the impact of pretrained LLMs, notably BERT, on text classification. These models have significantly improved the state-of-the-art by utilizing large-scale unlabeled data for pre-training, allowing them to capture complex linguistic patterns and improve performance on downstream tasks with minimal labeled data. The survey also discusses the ongoing research on Transformers, which champion parallel computation and self-attention mechanisms, enabling models to handle large datasets efficiently.

Datasets and Evaluation Metrics

Furthermore, the paper elaborates on the benchmarks and datasets used in evaluating text classification models, discussing popular datasets like 20 Newsgroups, IMDB, and SST, among others. It also assesses various evaluation metrics—such as accuracy, precision, recall, F1-score, and more complex multi-label metrics—used to gauge the performance of these models.

Future Research Challenges

The authors identify several future research challenges, particularly in improving model interpretability, semantic understanding, and robustness against adversarial inputs. They highlight the potential for integrating external knowledge bases to enhance text representations and the need for models capable of zero-shot and few-shot learning to address data scarcity in certain domains.

Conclusion

In conclusion, the paper provides an extensive survey of text classification models, marking a significant contribution by organizing and synthesizing vast amounts of research data. It underscores the transformation and incremental improvements witnessed in the field and sets out directives for future exploration to enhance the efficacy of text classification systems in capturing and understanding human language. The work serves as a critical resource for researchers seeking to understand the progression of methodologies and the current landscape of text classification in NLP.