Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Abusive Span Detection for Vietnamese Narrative Texts (2312.07831v1)

Published 13 Dec 2023 in cs.CL and cs.LG

Abstract: Abuse in its various forms, including physical, psychological, verbal, sexual, financial, and cultural, has a negative impact on mental health. However, there are limited studies on applying NLP in this field in Vietnam. Therefore, we aim to contribute by building a human-annotated Vietnamese dataset for detecting abusive content in Vietnamese narrative texts. We sourced these texts from VnExpress, Vietnam's popular online newspaper, where readers often share stories containing abusive content. Identifying and categorizing abusive spans in these texts posed significant challenges during dataset creation, but it also motivated our research. We experimented with lightweight baseline models by freezing PhoBERT and XLM-RoBERTa and using their hidden states in a BiLSTM to assess the complexity of the dataset. According to our experimental results, PhoBERT outperforms other models in both labeled and unlabeled abusive span detection tasks. These results indicate that it has the potential for future improvements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Stephen Afrifa. 2022. Cyberbullying detection on twitter using natural language processing and machine learning techniques. International Journal of Innovative Technology and Interdisciplinary Sciences 5, 4 (2022), 1069–1080.
  2. Natural language model for automatic identification of intimate partner violence reports from twitter. Array 15 (2022), 100217.
  3. Cyberbullying detection: an overview. In 2018 Cyber Resilience Conference (CRC). IEEE, 1–3.
  4. Hate speech detection is not as easy as you may think: A closer look at model validation (extended version). Information Systems 105 (2022), 101584.
  5. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6193–6202. https://aclanthology.org/2020.lrec-1.760
  6. Multilingual and multitarget hate speech detection in tweets. In Conférence sur le Traitement Automatique des Langues Naturelles (TALN-PFIA 2019). ATALA, 351–360.
  7. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
  8. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, Vol. 11. 512–515.
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
  10. Chuka Emezue et al. 2020. Digital or digitally delivered responses to domestic and intimate partner violence during COVID-19. JMIR public health and surveillance 6, 3 (2020), e19831.
  11. Detection of bangla hate comments and cyberbullying in social media using nlp and transformer models. In International Conference on Advances in Computing and Data Sciences. Springer, 86–96.
  12. Alex Graves and Alex Graves. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks (2012), 37–45.
  13. All you need is” love” evading hate speech detection. In Proceedings of the 11th ACM workshop on artificial intelligence and security. 2–12.
  14. ViHOS: Hate Speech Spans Detection for Vietnamese. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 652–669. https://aclanthology.org/2023.eacl-main.47
  15. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  16. SWSR: A Chinese dataset and lexicon for online sexism detection. Online Social Networks and Media 27 (2022), 100182.
  17. Benchmarking Aggression Identification in Social Media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018). Association for Computational Linguistics, Santa Fe, New Mexico, USA, 1–11. https://aclanthology.org/W18-4401
  18. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289. http://dl.acm.org/citation.cfm?id=645530.655813
  19. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 260–270. https://doi.org/10.18653/v1/N16-1030
  20. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
  21. A large-scale dataset for hate speech detection on Vietnamese social media texts. In Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices: 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26–29, 2021, Proceedings, Part I 34. Springer, 415–426.
  22. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 14867–14875.
  23. Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1037–1042. https://doi.org/10.18653/v1/2020.findings-emnlp.92
  24. Constructive and toxic speech detection for open-domain social media comments in vietnamese. In Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices: 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26–29, 2021, Proceedings, Part I 34. Springer, 572–583.
  25. Misogyny detection in twitter: a multilingual and cross-domain study. Information processing & management 57, 6 (2020), 102360.
  26. SemEval-2021 Task 5: Toxic Spans Detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics, Online, 59–69. https://doi.org/10.18653/v1/2021.semeval-1.6
  27. SemEval-2021 task 5: Toxic spans detection. In Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021). 59–69.
  28. The impact of different types of abuse on depression. Depression research and treatment 2021 (2021).
  29. Tharindu Ranasinghe and Hansi Hettiarachchi. 2019. Emoji powered capsule network to detect type and target of offensive posts in social media. (2019).
  30. Melanie F Shepard and James A Campbell. 1992. The Abusive Behavior Inventory: A measure of psychological and physical abuse. Journal of interpersonal violence 7, 3 (1992), 291–305.
  31. An introduction to conditional random fields. Foundations and Trends® in Machine Learning 4, 4 (2012), 267–373.
  32. Reach Team. 2017. 6 Different Types of Abuse. https://reachma.org/blog/6-different-types-of-abuse/
  33. Span Detection for Aspect-Based Sentiment Analysis in Vietnamese. In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation. Association for Computational Lingustics, Shanghai, China, 318–328. https://aclanthology.org/2021.paclic-1.34
  34. Amanda Stent Tina Tseng and Domenic Maida. 2020. Best Practices for Managing Data Annotation Projects. Bloomberg Finance L.P.
  35. VnExpress. 2023. Narrative section. https://vnexpress.net/tam-su
  36. HSD shared task in VLSP campaign 2019: Hate speech detection for social good. arXiv preprint arXiv:2007.06493 (2020).
  37. Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop. 88–93.
  38. The prevalence of elder abuse in institutional settings: a systematic review and meta-analysis. European journal of public health 29, 1 (2019), 58–67.
  39. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019).
  40. SemEval-2020 task 12: Multilingual offensive language identification in social media (OffensEval 2020). arXiv preprint arXiv:2006.07235 (2020).
  41. Detecting hate speech on twitter using a convolution-gru based deep neural network. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15. Springer, 745–760.
  42. HITSZ-HLT at SemEval-2021 task 5: Ensemble sequence labeling and span boundary detection for toxic span detection. In Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021). 521–526.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Nhu-Thanh Nguyen (1 paper)
  2. Khoa Thi-Kim Phan (1 paper)
  3. Duc-Vu Nguyen (18 papers)
  4. Ngan Luu-Thuy Nguyen (56 papers)

Summary

We haven't generated a summary for this paper yet.