Sentiment Classification in Swahili Language Using Multilingual BERT (2104.09006v1)

Published 19 Apr 2021 in cs.CL

Abstract: The evolution of the Internet has increased the amount of information that is expressed by people on different platforms. This information can be product reviews, discussions on forums, or social media platforms. Accessibility of these opinions and peoples feelings open the door to opinion mining and sentiment analysis. As language and speech technologies become more advanced, many languages have been used and the best models have been obtained. However, due to linguistic diversity and lack of datasets, African languages have been left behind. In this study, by using the current state-of-the-art model, multilingual BERT, we perform sentiment classification on Swahili datasets. The data was created by extracting and annotating 8.2k reviews and comments on different social media platforms and the ISEAR emotion dataset. The data were classified as either positive or negative. The model was fine-tuned and achieve the best accuracy of 87.59%.

Authors (3)

Gati L. Martin (1 paper)
Medard E. Mswahili (1 paper)
Young-Seob Jeong (1 paper)

Citations (18)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning multilingual BERT on an 8.2k Swahili dataset achieves an impressive 87.59% accuracy.
It compiles and annotates reviews from social media and the ISEAR dataset for a clear binary sentiment classification task.
The study highlights the potential of adapting advanced NLP models to enhance research for underrepresented African languages.

The paper "Sentiment Classification in Swahili Language Using Multilingual BERT" addresses the challenge of opinion mining and sentiment analysis for Swahili, a language often underrepresented in NLP research due to linguistic diversity and limited available datasets. The researchers leverage the multilingual BERT model, a state-of-the-art tool in NLP, to perform sentiment classification tasks on Swahili-language data.

Key contributions and findings of the paper include:

Dataset Compilation and Annotation:
- The researchers compiled a Swahili sentiment analysis dataset by extracting 8.2k reviews and comments from various social media platforms.
- Additional data were sourced from the ISEAR (International Survey on Emotion Antecedents and Reactions) emotion dataset, enhancing the resource pool.
- Each entry within these datasets was annotated as either positive or negative, providing a clear binary classification task.
Model Fine-Tuning:
- The paper utilized the multilingual BERT, which is pretrained on multiple languages, including Swahili. This model was chosen for its capability to handle tasks across diverse languages with limited data.
- The researchers fine-tuned the multilingual BERT specifically for the task of sentiment classification on their compiled Swahili dataset.
Performance and Accuracy:
- The fine-tuned model achieved an impressive accuracy of 87.59% on the sentiment classification task.
- This performance demonstrates that leveraging a multilingual pretrained model like BERT can effectively address some of the challenges posed by lesser-resourced languages.
Implications for African Languages in NLP:
- The paper highlights the potential of advanced NLP techniques to serve underrepresented languages by adapting existing state-of-the-art models.
- It underscores the importance of creating annotated datasets for these languages to further research and applications in sentiment analysis and beyond.

In conclusion, the paper represents a significant step forward in the application of modern NLP methods to African languages. By illustrating the successful use of multilingual BERT for Swahili sentiment classification, the paper provides a framework that could be adapted to other underrepresented languages, promoting wider inclusivity in NLP research.

PDF Markdown

Sentiment Classification in Swahili Language Using Multilingual BERT (2104.09006v1)

Summary

Related Papers