Fast and accurate sentiment classification using an enhanced Naive Bayes model (1305.6143v2)

Published 27 May 2013 in cs.CL, cs.IR, and cs.LG

Abstract: We have explored different methods of improving the accuracy of a Naive Bayes classifier for sentiment analysis. We observed that a combination of methods like negation handling, word n-grams and feature selection by mutual information results in a significant improvement in accuracy. This implies that a highly accurate and fast sentiment classifier can be built using a simple Naive Bayes model that has linear training and testing time complexities. We achieved an accuracy of 88.80% on the popular IMDB movie reviews dataset.

Authors (3)

Vivek Narayanan (3 papers)
Ishan Arora (2 papers)
Arjun Bhatia (1 paper)

Citations (268)

View on Semantic Scholar

Summary

The paper presents enhancements to Naive Bayes by incorporating negation handling, n-grams, and mutual information for feature selection to boost sentiment classification accuracy.
It details methodological improvements that transform negated terms and utilize bigrams and trigrams, significantly narrowing the performance gap with more complex models like SVMs.
The approach achieves an 88.80% accuracy on the IMDB dataset with linear training and testing complexity, demonstrating a fast, scalable, and cost-effective solution.

Analysis of Enhanced Naive Bayes Model for Sentiment Classification

The paper by Narayanan, Arora, and Bhatia presents enhancements to the Naive Bayes classifier for sentiment analysis, achieving significant improvements in both speed and accuracy over traditional models. While the Naive Bayes classifier is inherently simplistic, leveraging conditional independence assumptions, this research demonstrates its capacity to perform sentiment classification comparable to more complex methods such as Support Vector Machines (SVMs).

Methodological Contributions

The authors have incorporated several methodological enhancements to the Naive Bayes framework. These include:

Negation Handling: The paper outlines a mechanism for handling negations, which often pose a challenge in sentiment analysis. By transforming words preceded by negations into forms like "not_good," the classifier better captures sentiment polarity shifts, contributing to a 1% increase in accuracy.
N-Grams Utilization: Beyond unigrams, the inclusion of bigrams and trigrams enables the model to harness context provided by word sequences. Such n-grams encode more nuanced sentiment clues, improving the model's sensitivity to sentiment-laden phrases.
Feature Selection with Mutual Information: To mitigate the risk of overfitting due to high-dimensional feature spaces introduced by n-grams, the authors employed mutual information for feature selection. This approach effectively reduces noise while preserving informative features, culminating in an optimal feature count of 32,000.

Results and Performance Evaluation

The enhanced Naive Bayes classifier achieved an accuracy of 88.80% on the IMDB movie review dataset—a significant improvement from an initial baseline of 73.77% for the vanilla Naive Bayes with Laplacian smoothing. The classifier demonstrates linear complexity for both training (in O(n + V log V) time) and testing (in O(n) time), presenting a clear advantage in terms of computational efficiency compared to SVMs and maximum entropy models.

The timeline of enhancements suggests that each methodological adjustment—beginning from negation handling to feature selection—contributed progressively to the overall performance of the classifier. These improvements underscore the potential for even relatively simplistic models to closely approach the accuracy contours of state-of-the-art sentiment classification models when carefully designed enhancements are applied.

Implications and Future Directions

The results have important implications for sentiment analysis and text classification at large. By validating the efficacy of simple probabilistic models alongside sophisticated machine learning algorithms, this research not only provides a cost-effective solution for text categorization but also invites further exploration into similar enhancements applicable to other machine learning models.

Future developments could include the exploration of hybrid models that integrate the strengths of Naive Bayes with deep learning architectures, leveraging the simplicity and speed of Naive Bayes for preliminary sentiment filters followed by more intricate analyses. Additionally, the extension of these methodologies to multilingual datasets, or domains beyond movie reviews, such as social media sentiment analysis, could be potential avenues for further research.

In conclusion, the paper exemplifies a concerted effort to reevaluate and enhance traditional machine learning models, bringing forth efficient avenues for sentiment classification that are both fast and scalable.

PDF Markdown