Bag of Tricks for Efficient Text Classification (1607.01759v3)

Published 6 Jul 2016 in cs.CL

Abstract: This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore~CPU, and classify half a million sentences among~312K classes in less than a minute.

Citations (4,429)

View on Semantic Scholar

Summary

The paper introduces fastText, a model that optimizes linear classifiers with bag-of-words and n-gram features for efficient text classification.
It leverages low-rank matrix factorization and hierarchical softmax to dramatically reduce training complexity on large datasets.
Experimental results show that fastText attains comparable accuracy to complex neural networks while offering substantial speed improvements.

Bag of Tricks for Efficient Text Classification

Introduction

The paper "Bag of Tricks for Efficient Text Classification" presents a foundational baseline approach for text classification tasks. Text classification is critical in various NLP applications, including web search, information retrieval, ranking, and document classification. While neural network-based models such as CNNs and RNNs have gained prominence due to their performance, they typically require considerable computational resources, making their scalability an issue for large datasets. Conversely, linear classifiers, despite their simplicity, have shown strong performance when the right features are utilized. This study explores optimizing linear classifiers to handle extensive corpora efficiently.

Model Architecture

The model architecture proposed in the paper leverages a Bag of Words (BoW) approach combined with a linear classifier, such as logistic regression or SVM. The architecture is optimized through rank constraints and fast loss approximations. The core idea is to represent sentences as BoW and use low-rank matrices for parameter sharing among features and classes. This approach significantly reduces the complexity of training and evaluation.

The model architecture is depicted in the paper as follows:

Weight Matrix (A): Acts as a lookup table over the words.
Text Representation: Formed by averaging the word representations.
Linear Classifier (B): Takes the text representation as input.

A hierarchical softmax is employed to cope with a large number of classes. This reduces the computational complexity from $O(kh)$ to $O(h\log_2(k))$ during training. Additionally, using a bag of n-grams as features helps in capturing the local word order efficiently.

Implementation and Efficiency

The implementation of the proposed model, termed fastText, demonstrates remarkable efficiency. Training on a corpus of over one billion words can be completed in less than ten minutes using standard multicore CPUs. The fastText model is also highly scalable, capable of classifying half a million sentences across 312K classes in under a minute.

Experimental Results

The paper reports evaluations on two primary tasks: sentiment analysis and tag prediction.

Sentiment Analysis: The model is evaluated on eight datasets. Results show that fastText, augmented with bigram information, achieves performance comparable to or better than state-of-the-art methods, such as char-CNN, char-CRNN, and VDCNN, while being significantly faster. Adding bigram features improves accuracy by 1-4%, and the performance on certain datasets like Sogou reaches 97.1% with trigrams.
Training Time: Compared to neural network-based methods, fastText shows orders of magnitude speedup in training time, particularly valuable for large datasets. For instance, the method achieves a training time of under a minute for datasets where neural networks require hours.
Tag Prediction: Using the YFCC100M dataset for scalability testing, fastText demonstrates robust performance with a substantial speed advantage. The hierarchical softmax brings efficiency during both training and testing phases, showing a test time reduction by a factor of 600 compared to Tagspace.

Discussion and Conclusion

The paper presents a practical and efficient baseline for text classification. The fastText model, through its simplicity and speed, provides a viable alternative to more complex deep learning approaches for large-scale text classification tasks. The findings suggest that while deep neural networks have higher theoretical representational power, simpler models like fastText are well-suited for tasks such as sentiment analysis. The release of the codebase will facilitate further research and development by the community.

In summary, this paper offers insight into optimizing linear classifiers for large-scale text classification, demonstrating impressive accuracy and unparalleled efficiency, thus highlighting the potential of traditional models enhanced with modern computational techniques.