Chinese Offensive Language Detection:Current Status and Future Directions (2403.18314v3)
Abstract: Despite the considerable efforts being made to monitor and regulate user-generated content on social media platforms, the pervasiveness of offensive language, such as hate speech or cyberbullying, in the digital space remains a significant challenge. Given the importance of maintaining a civilized and respectful online environment, there is an urgent and growing need for automatic systems capable of detecting offensive speech in real time. However, developing effective systems for processing languages such as Chinese presents a significant challenge, owing to the language's complex and nuanced nature, which makes it difficult to process automatically. This paper provides a comprehensive overview of offensive language detection in Chinese, examining current benchmarks and approaches and highlighting specific models and tools for addressing the unique challenges of detecting offensive language in this complex language. The primary objective of this survey is to explore the existing techniques and identify potential avenues for further research that can address the cultural and linguistic complexities of Chinese.
- COLD: A benchmark for Chinese offensive language detection. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (Abu Dhabi, United Arab Emirates: Association for Computational Linguistics) (2022a), 11580–11599. 10.18653/v1/2022.emnlp-main.796.
- Offensive language detection: A comparative analysis. arXiv preprint arXiv:2001.03131 (2020).
- Automated hate speech detection and the problem of offensive language. Proceedings of the international AAAI conference on web and social media (2017), vol. 11, 512–515.
- Effective hate-speech detection in twitter data using recurrent neural networks. Applied Intelligence 48 (2018) 4730–4742.
- Offensive language and hate speech detection with deep learning and transfer learning (2021).
- Fatemah H, Ozlem U. A survey of offensive language detection for the arabic language. ACM Transactions on Asian and Low-Resource Language Information Processing 20 (2021).
- Building a formal model for hate detection in french corpora. Procedia Computer Science 176 (2020) 2358–2365.
- A turkish hate speech dataset and detection system. Proceedings of the Thirteenth Language Resources and Evaluation Conference (2022), 4177–4185.
- Dhanya L, Balakrishnan K. Hate speech detection in asian languages: a survey. 2021 international conference on communication, control and information sciences (ICCISc) (IEEE) (2021), vol. 1, 1–5.
- Cross-cultural transfer learning for chinese offensive language detection. arXiv preprint arXiv:2303.17927 (2023).
- Rephrasing profanity in chinese text. Proceedings of the First Workshop on Abusive Language Online (2017), 18–24.
- Chuang YS. Robust chinese word segmentation with contextualized word representations. arXiv preprint arXiv:1901.05816 (2019).
- Jay T, Janschewitz K. The pragmatics of swearing (2008).
- Racial bias in hate speech and abusive language detection datasets. arXiv preprint arXiv:1905.12516 (2019).
- Understanding abuse: A typology of abusive language detection subtasks. Proceedings of the First Workshop on Abusive Language Online (Vancouver, BC, Canada: Association for Computational Linguistics) (2017), 78–84. 10.18653/v1/W17-3012.
- Sigurbergsson GI, Derczynski L. Offensive language and hate speech detection for Danish. Proceedings of the Twelfth Language Resources and Evaluation Conference (Marseille, France: European Language Resources Association) (2020), 3498–3508.
- Adult content detection on Arabic Twitter: Analysis and experiments. Proceedings of the Sixth Arabic Natural Language Processing Workshop (Kyiv, Ukraine (Virtual): Association for Computational Linguistics) (2021), 136–144.
- Skalicky S, Crossley S. Linguistic features of sarcasm and metaphor production quality. Proceedings of the Workshop on Figurative Language Processing (2018), 7–16.
- Kreuz RJ, Glucksberg S. How to be sarcastic: The echoic reminder theory of verbal irony. Journal of experimental psychology: General 118 (1989) 374.
- Understanding the phenomenon of sarcasm. Investigations in Computational Sarcasm (2018) 33–57.
- Sarcasm as contrast between a positive sentiment and negative situation. Conference on Empirical Methods in Natural Language Processing (2013).
- Killing me softly: Creative and cognitive aspects of implicitness in abusive language online. Natural Language Engineering (2022a).
- The unbearable hurtfulness of sarcasm. Expert Syst. Appl. 193 (2022b) 116398.
- “so you think you’re funny?”: Rating the humour quotient in standup comedy. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (Online and Punta Cana, Dominican Republic: Association for Computational Linguistics) (2021), 10073–10079. 10.18653/v1/2021.emnlp-main.789.
- BEIKE NLP at SemEval-2022 task 4: Prompt-based paragraph classification for patronizing and condescending language detection. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (Seattle, United States: Association for Computational Linguistics) (2022b), 319–323. 10.18653/v1/2022.semeval-1.41.
- Yang H, Lin CJ. TOCP: A dataset for Chinese profanity processing. Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (Marseille, France: European Language Resources Association (ELRA)) (2020), 6–12.
- Chung I, Lin CJ. Tocab: A dataset for chinese abusive language processing. 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI) (2021), 445–452. 10.1109/IRI51335.2021.00069.
- Swsr: A chinese dataset and lexicon for online sexism detection (2021).
- Tang X, Shen X. Categorizing offensive language in social networks: A Chinese corpus, systems and an explainable tool. Proceedings of the 19th Chinese National Conference on Computational Linguistics (Haikou, China: Chinese Information Processing Society of China) (2020), 1045–1056.
- The design and construction of a Chinese sarcasm dataset. Proceedings of the Twelfth Language Resources and Evaluation Conference (Marseille, France: European Language Resources Association) (2020), 5034–5039.
- Zhu Y. Open chinese internet sarcasm corpus construction: An approach. Frontiers in Computing and Intelligent Systems 2 (2022) 7–9. 10.54097/fcis.v2i1.2484.
- A novel chinese sarcasm detection model based on retrospective reader. International Conference on Multimedia Modeling (Springer) (2022), 270–282.
- Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 1 (2010) 43–52.
- Chinese abusive language detection based on BERT and LDA. Computational intelligence and neuroscience 2020 (2020).
- A lexicon-based approach for identifying offensive language in chinese text. 2018 conference on empirical methods in natural language processing (2018).
- A semi-supervised approach for detecting explicit and implicit hate speech. 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI) (IEEE) (2019), 568–573.
- AI bug detector: Adversarial input detection for natural language processing models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (2020a), 187–196.
- WeiboHate: A large-scale chinese abusive language dataset from weibo. Proceedings of the 2019 11th International Conference on Computer and Automation Engineering (2019), 41–45.
- Combating negative stereotypes: A computational approach for exposing implicit bias in chinese. Proceedings of the 28th International Conference on Computational Linguistics (2020b), 6568–6581.
- HurtMePlenty: A corpus of nuanced hate speech Twitter posts in chinese. Proceedings of the Third Workshop on Abusive Language Online (2020), 85–94.
- BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (Minneapolis, Minnesota: Association for Computational Linguistics) (2019), 4171–4186. 10.18653/v1/N19-1423.
- Detect chinese cyber bullying by analyzing user behaviors and language patterns. 2019 3rd International Symposium on Autonomous Systems (ISAS) (2019), 370–375. 10.1109/ISASS.2019.8757714.
- Extending emotional lexicon for improving the classification accuracy of chinese film reviews. Connection Science 33 (2021) 153–172. 10.1080/09540091.2020.1782839.
- Yunze Xiao (13 papers)
- Houda Bouamor (18 papers)
- Wajdi Zaghouani (26 papers)