Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark (2306.07902v1)

Published 13 Jun 2023 in cs.CL and cs.AI

Abstract: Despite impressive advancements in multilingual corpora collection and model training, developing large-scale deployments of multilingual models still presents a significant challenge. This is particularly true for language tasks that are culture-dependent. One such example is the area of multilingual sentiment analysis, where affective markers can be subtle and deeply ensconced in culture. This work presents the most extensive open massively multilingual corpus of datasets for training sentiment models. The corpus consists of 79 manually selected datasets from over 350 datasets reported in the scientific literature based on strict quality criteria. The corpus covers 27 languages representing 6 language families. Datasets can be queried using several linguistic and functional features. In addition, we present a multi-faceted sentiment classification benchmark summarizing hundreds of experiments conducted on different base models, training objectives, dataset collections, and fine-tuning strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. Sentiment classifier: Logistic regression for arabic services’ reviews in lebanon. In 2019 International Conference on Computer and Information Sciences (ICCIS), pages 1–5, 2019. doi: 10.1109/ICCISci.2019.8716394.
  2. LABR: A large scale Arabic book reviews dataset. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 494–498, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL https://aclanthology.org/P13-2088.
  3. Representations and architectures in neural sentiment analysis for morphologically rich languages: A case study from Modern Hebrew. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2242–2252, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/C18-1190.
  4. Character level embedding with deep convolutional neural network for text normalization of unstructured data for twitter sentiment analysis. Social Network Analysis and Mining, 9(1):12, Mar 2019. ISSN 1869-5469. doi: 10.1007/s13278-019-0557-y. URL https://doi.org/10.1007/s13278-019-0557-y.
  5. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics, 7:597–610, 09 2019. ISSN 2307-387X. doi: 10.1162/tacl_a_00288. URL https://doi.org/10.1162/tacl_a_00288.
  6. Multilingual multi-class sentiment classification using convolutional neural networks. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1101.
  7. ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets. In Hend Al-Khalifa, King Saud University, KSA Walid Magdy, University of Edinburgh, UK Kareem Darwish, Qatar Computing Research Institute, Qatar Tamer Elsayed, Qatar University, and Qatar, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France, may 2018. European Language Resources Association (ELRA). ISBN 979-10-95546-25-2.
  8. Overview of the Evalita 2016 SENTIment POLarity Classification Task. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 2016. URL https://hal.inria.fr/hal-01414731.
  9. Xlm-t: A multilingual language model toolkit for twitter. arXiv e-prints, pages arXiv–2104, 2021.
  10. Author’s sentiment prediction. In Proceedings of the 28th International Conference on Computational Linguistics, pages 604–615, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.52. URL https://aclanthology.org/2020.coling-main.52.
  11. Reliable baselines for sentiment analysis in resource-limited languages: The Serbian movie review dataset. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2688–2696, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). URL https://aclanthology.org/L16-1427.
  12. A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts. PLOS ONE, 15(11):1–30, 11 2020. doi: 10.1371/journal.pone.0242050. URL https://doi.org/10.1371/journal.pone.0242050.
  13. Henrico Brum and Maria das Graças Volpe Nunes. Building a sentiment corpus of tweets in Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1658.
  14. Annotated news corpora and a lexicon for sentiment analysis in slovene. Language Resources and Evaluation, 52(3):895–919, September 2018. doi: 10.1007/s10579-018-9413-3.
  15. Multilingual sentiment analysis: An RNN-based framework for limited data. Computing Research Repository, arXiv:1806.04511, 2018. URL http://arxiv.org/abs/1806.04511. Version 1.
  16. Hierarchical pre-training for sequence labelling in spoken dialog. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2636–2648, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.239. URL https://aclanthology.org/2020.findings-emnlp.239.
  17. A Twitter corpus and benchmark resources for German sentiment analysis. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pages 45–51, Valencia, Spain, April 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-1106. URL https://aclanthology.org/W17-1106.
  18. Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc. URL https://dl.acm.org/doi/10.5555/3454287.3454921.
  19. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1269. URL https://aclanthology.org/D18-1269.
  20. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  21. Experiments in sentiment classification of movie reviews in spanish. Procesamiento del Lenguaje Natural, 41:73–80, 2008.
  22. Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cognitive computation, 8(4):757–771, 2016. doi: 10.1007/s12559-016-9415-7.
  23. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  24. Matthew S. Dryer and Martin Haspelmath, editors. WALS Online (v2020.3). Zenodo, 2013. doi: 10.5281/zenodo.7385533. URL https://doi.org/10.5281/zenodo.7385533.
  25. BRAD 1.0: Book reviews in arabic dataset. In 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), pages 1–8, 2016. doi: 10.1109/AICCSA.2016.7945800.
  26. Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications. Springer International Publishing, Cham, 2018. ISBN 978-3-319-67056-0. doi: 10.1007/978-3-319-67056-0_3. URL https://doi.org/10.1007/978-3-319-67056-0_3.
  27. Language-agnostic BERT Sentence Embedding. Computing Research Repository, arXiv:2007.01852, 2020. Version 2.
  28. Steven L Gordon. The sociology of sentiments and emotion. In Social psychology, pages 562–592. Routledge, 2017.
  29. Sentiment analysis in Czech social media using supervised machine learning. In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 65–74, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/W13-1609.
  30. SentiPers: A sentiment analysis corpus for persian. Computing Research Repository, arXiv:1801.07737, 2018. URL http://arxiv.org/abs/1801.07737. Version 2.
  31. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/hu20b.html.
  32. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media, volume 8, pages 216–225, May 2014. URL https://ojs.aaai.org/index.php/ICWSM/article/view/14550.
  33. Crowdflower Inc. Twitter us airline sentiment, 2015. URL https://www.kaggle.com/crowdflower/twitter-airline-sentiment.
  34. Cross-lingual ability of multilingual bert: An empirical study. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJeT3yrtDr.
  35. Cross-lingual deep neural transfer learning in sentiment analysis. Procedia Computer Science, 176:128–137, 2020. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2020.08.014. URL https://www.sciencedirect.com/science/article/pii/S187705092031838X. Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24th International Conference KES2020.
  36. Sentiment analysis and opinion mining applied to scientific paper reviews. Intelligent Data Analysis, 23:191–214, 02 2019. doi: 10.3233/IDA-173807.
  37. The multilingual Amazon reviews corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4563–4568, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.369. URL https://aclanthology.org/2020.emnlp-main.369.
  38. Multi-level sentiment analysis of PolEmo 2.0: Extended corpus of multi-domain consumer reviews. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 980–991, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/K19-1092. URL https://aclanthology.org/K19-1092.
  39. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315–7330, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.653. URL https://aclanthology.org/2020.acl-main.653.
  40. How language-neutral is multilingual bert? Computing Research Repository, arXiv:1911.03310, 2019. Version 1.
  41. An empirical study on sentiment classification of Chinese review using word embedding. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters, pages 258–266, Shanghai, China, October 2015. URL https://aclanthology.org/Y15-2030.
  42. What makes multilingual bert multilingual? Computing Research Repository, arXiv:2010.10938, 2020. Version 1.
  43. Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264, 2020.
  44. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796, apr 2014. ISSN 2330-1635. doi: 10.1002/asi.23062. URL https://doi.org/10.1002/asi.23062.
  45. Efficient estimation of word representations in vector space. In Yoshua Bengio and Yann LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013a. URL http://arxiv.org/abs/1301.3781.
  46. Exploiting similarities among languages for machine translation. Computing Research Repository, arXiv:1309.4168, 2013b. Version 1.
  47. Multilingual twitter sentiment classification: The role of human annotators. PLOS ONE, 11(5):1–26, 05 2016. doi: 10.1371/journal.pone.0155036. URL https://doi.org/10.1371/journal.pone.0155036.
  48. ASTD: Arabic sentiment tweets dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2515–2519, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1299. URL https://aclanthology.org/D15-1299.
  49. Language-independent twitter sentiment analysis. In Workshop on Knowledge Discovery, Data Mining and Machine Learning (KDML-2012), 2012.
  50. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1018. URL https://aclanthology.org/D19-1018.
  51. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021.
  52. SemEval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 774–790, Barcelona (online), December 2020a. International Committee for Computational Linguistics. doi: 10.18653/v1/2020.semeval-1.100. URL https://aclanthology.org/2020.semeval-1.100.
  53. SemEval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 774–790, Barcelona (online), December 2020b. International Committee for Computational Linguistics. URL https://aclanthology.org/2020.semeval-1.100.
  54. Zero-shot learning for cross-lingual news sentiment classification. Applied Sciences, 10(17):5993, 2020a.
  55. Zero-shot learning for cross-lingual news sentiment classification. Applied Sciences, 10(17), 2020b. ISSN 2076-3417. doi: 10.3390/app10175993. URL https://www.mdpi.com/2076-3417/10/17/5993.
  56. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.365. URL https://aclanthology.org/2020.emnlp-main.365.
  57. RuSentiment: An enriched sentiment analysis dataset for social media in Russian. In Proceedings of the 27th International Conference on Computational Linguistics, pages 755–763, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/C18-1064.
  58. SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 502–518, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2088. URL https://aclanthology.org/S17-2088.
  59. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65:569–631, Aug 2019. ISSN 1076-9757. doi: 10.1613/jair.1.11640. URL http://dx.doi.org/10.1613/jair.1.11640.
  60. KLEJ: Comprehensive benchmark for Polish language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1191–1201, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.111. URL https://aclanthology.org/2020.acl-main.111.
  61. A review on multi-lingual sentiment analysis by machine learning methods. Journal of Engineering Science & Technology Review, 13(2):154–166, 2020. doi: 10.25103/jestr.132.19.
  62. Sentiment after translation: A case-study on Arabic social media posts. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 767–777, Denver, Colorado, May–June 2015. Association for Computational Linguistics. doi: 10.3115/v1/N15-1078. URL https://aclanthology.org/N15-1078.
  63. Niek J Sanders. Sanders-Twitter Sentiment Corpus. Sanders Analytics LLC, 2011.
  64. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Computing Research Repository, arXiv:1910.01108, 2020. Version 4.
  65. Zero-shot multilingual sentiment analysis using hierarchical attentive network and bert. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval, pages 49–56, 2019.
  66. Academic-industrial perspective on the development and deployment of a moderation system for a newspaper website. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1253.
  67. Performing natural language processing on roman urdu datasets. In International Journal of Computer Science and Network Security, volume 18, pages 141–148, 2018. URL http://paper.ijcsns.org/07_book/201801/20180117.pdf.
  68. Investigating societal biases in a poetry composition system. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, pages 93–106, Barcelona, Spain (Online), December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.gebnlp-1.9.
  69. BERT is not an interlingua and the bias of tokenization. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 47–55, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-6106. URL https://aclanthology.org/D19-6106.
  70. Two-year study of emotion and communication patterns in a highly polarized political discussion forum. Social Science Computer Review, 30(4):448–469, 2012. doi: 10.1177/0894439312436512.
  71. What can we learn from almost a decade of food tweets. Computing Research Repository, arXiv:2007.05194, 2020. URL https://arxiv.org/abs/2007.05194. Version 2.
  72. Rachele Sprugnoli. Multiemotions-it: a new dataset for opinion polarity and emotion analysis for italian. In Proceedings of the Seventh Italian Conference on Computational Linguistics, 12 2020.
  73. Pythainlp/wisesight-sentiment: First release (v1.0), September 2019. URL https://doi.org/10.5281/zenodo.3457447. Zenodo.
  74. Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol., 63(1):163–173, January 2012. ISSN 1532-2882. doi: 10.1002/asi.21662. URL https://doi.org/10.1002/asi.21662.
  75. wongnai-corpus. https://github.com/wongnai/wongnai-corpus, 2019.
  76. Erik F. Tjong Kim Sang. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), 2002. URL https://aclanthology.org/W02-2024.
  77. Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147, 2003. URL https://aclanthology.org/W03-0419.
  78. A comparison of architectures and pretraining methods for contextualized multilingual word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9090–9097, 2020.
  79. A character-based convolutional neural network for language-agnostic twitter sentiment analysis. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2384–2391, 2017. doi: 10.1109/IJCNN.2017.7966145.
  80. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.repl4nlp-1.16. URL https://aclanthology.org/2020.repl4nlp-1.16.
  81. Distributed word representation learning for cross-lingual dependency parsing. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 119–129, Ann Arbor, Michigan, June 2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-1613. URL https://aclanthology.org/W14-1613.
  82. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
  83. Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5370–5378. International Joint Conferences on Artificial Intelligence Organization, 7 2019. doi: 10.24963/ijcai.2019/746. URL https://doi.org/10.24963/ijcai.2019/746.
  84. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.12. URL https://aclanthology.org/2020.acl-demos.12.
  85. The united nations parallel corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 3530–3534, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). URL https://www.aclweb.org/anthology/L16-1561.
  86. Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 60–67, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-2512. URL https://www.aclweb.org/anthology/W17-2512.
Citations (3)

Summary

We haven't generated a summary for this paper yet.