Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 164 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Enhancing Sentiment Analysis Results through Outlier Detection Optimization (2311.16185v1)

Published 25 Nov 2023 in cs.LG, cs.AI, and cs.CL

Abstract: When dealing with text data containing subjective labels like speaker emotions, inaccuracies or discrepancies among labelers are not uncommon. Such discrepancies can significantly affect the performance of machine learning algorithms. This study investigates the potential of identifying and addressing outliers in text data with subjective labels, aiming to enhance classification outcomes. We utilized the Deep SVDD algorithm, a one-class classification method, to detect outliers in nine text-based emotion and sentiment analysis datasets. By employing both a small-sized LLM (DistilBERT base model with 66 million parameters) and non-deep learning machine learning algorithms (decision tree, KNN, Logistic Regression, and LDA) as the classifier, our findings suggest that the removal of outliers can lead to enhanced results in most cases. Additionally, as outliers in such datasets are not necessarily unlearnable, we experienced utilizing a LLM -- DeBERTa v3 large with 131 million parameters, which can capture very complex patterns in data. We continued to observe performance enhancements across multiple datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Altman, N. S. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3): 175–185.
  2. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, 14(11).
  3. Incorporating structured commonsense knowledge in story completion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 6244–6251.
  4. Toxic Comment Classification Challenge.
  5. Cramer, J. S. 2002. The origins of logistic regression.
  6. Hate Speech Dataset from a White Supremacy Forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 11–20. Brussels, Belgium: Association for Computational Linguistics.
  7. Social media toxicity classification using deep learning: real-world application UK Brexit. Electronics, 10(11): 1332.
  8. Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37–45.
  9. Deep Residual Learning for Image Recognition. arXiv:1512.03385.
  10. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654.
  11. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. arXiv:2005.05117.
  12. Topic enhanced word embedding for toxic content detection in Q&A sites. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 1064–1071.
  13. SampleClean: Fast and Reliable Analytics on Dirty Data. IEEE Data Eng. Bull., 38(3): 59–75.
  14. ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models. arXiv:1601.03797.
  15. AlphaClean: Automatic Generation of Data Cleaning Pipelines. arXiv:1904.11827.
  16. Learning multiple layers of features from tiny images.
  17. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324.
  18. Loh, W.-Y. 2011. Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(1): 14–23.
  19. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–150. Portland, Oregon, USA: Association for Computational Linguistics.
  20. Tweet Sentiment Extraction.
  21. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), 41–48. Ieee.
  22. I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Reviews. arXiv:2104.06893.
  23. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683.
  24. Modeling naive psychology of characters in simple commonsense stories. arXiv preprint arXiv:1805.06533.
  25. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084.
  26. HoloClean: Holistic Data Repairs with Probabilistic Inference. arXiv:1702.00820.
  27. Deep one-class classification. In International conference on machine learning, 4393–4402. PMLR.
  28. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
  29. CARER: Contextualized Affect Representations for Emotion Recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 3687–3697. Brussels, Belgium: Association for Computational Linguistics.
  30. The structure of toxic conversations on Twitter. In Proceedings of the Web Conference 2021, 1086–1097.
  31. Socialnlp emotionx 2019 challenge overview: Predicting emotions in spoken dialogues and chats. arXiv preprint arXiv:1909.07734.
  32. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642. Seattle, Washington, USA: Association for Computational Linguistics.
  33. Support vector data description. Machine learning, 54: 45–66.
  34. Character-Level Convolutional Networks for Text Classification. arXiv:1509.01626 [cs].
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Authors (2)