Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 111 tok/s Pro
Kimi K2 161 tok/s Pro
GPT OSS 120B 412 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models (2311.09428v2)

Published 15 Nov 2023 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., "non-abusive") and flip their labels to the unfavored outcome, i.e., "abusive". Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Stereotypical bias removal for hate speech detection task using knowledge-based generalizations. In The World Wide Web Conference. 49–59.
  2. Your fairness may vary: Pretrained language model fairness in toxic text classification. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 2245–2262. https://doi.org/10.18653/v1/2022.findings-acl.176
  3. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13. Springer, 387–402.
  4. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389 (2012).
  5. Xbully: Cyberbullying detection within a multi-modal context. In Proceedings of the twelfth acm international conference on web search and data mining. 339–347.
  6. Mitigating bias in session-based cyberbullying detection: A non-compromising approach. In The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), Vol. 1.
  7. Racial bias in hate speech and abusive language detection datasets. arXiv preprint arXiv:1905.12516 (2019).
  8. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, Vol. 11. 512–515.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 67–73.
  11. Un-Fair Trojan: Targeted Backdoor Attacks Against Model Fairness. In 2022 Ninth International Conference on Software Defined Systems (SDS). IEEE, 1–9.
  12. Backdoor attacks and countermeasures on deep learning: A comprehensive review. arXiv preprint arXiv:2007.10760 (2020).
  13. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  14. All you need is” love” evading hate speech detection. In Proceedings of the 11th ACM workshop on artificial intelligence and security. 2–12.
  15. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access 7 (2019), 47230–47244.
  16. Equality of opportunity in supervised learning. arXiv preprint arXiv:1610.02413 (2016).
  17. Deceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 (2017).
  18. Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. In 2018 IEEE symposium on security and privacy (SP). IEEE, 19–35.
  19. Md Saroar Jahan and Mourad Oussalah. 2023. A systematic review of Hate Speech automatic detection using Natural Language Processing. Neurocomputing (2023), 126232.
  20. Stronger data poisoning attacks break data sanitization defenses. Machine Learning (2022), 1–47.
  21. Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660 (2020).
  22. Deqiang Li and Qianmu Li. 2020. Adversarial deep ensemble: Evasion attacks and defenses for malware detection. IEEE Transactions on Information Forensics and Security 15 (2020), 3886–3900.
  23. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems (2022).
  24. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988.
  25. The variational fair autoencoder. arXiv preprint arXiv:1511.00830 (2015).
  26. Deep learning for online harassment detection in tweets. In 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS). IEEE, 1–5.
  27. OpenAI. 2021. ChatGPT: Large-scale language model for conversation. https://openai.com/research/chatgpt.
  28. Reducing gender bias in abusive language detection. arXiv preprint arXiv:1808.07231 (2018).
  29. Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  30. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics. 1085–1097.
  31. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers). 856–865.
  32. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics. 1668–1678.
  33. Poison frogs! targeted clean-label poisoning attacks on neural networks. Advances in neural information processing systems 31 (2018).
  34. When does machine learning {{\{{FAIL}}\}}? generalized transferability for evasion and poisoning attacks. In 27th {normal-{\{{USENIX}normal-}\}} Security Symposium ({normal-{\{{USENIX}normal-}\}} Security 18). 1299–1316.
  35. Lichao Sun. 2020. Natural backdoor attack on text data. arXiv preprint arXiv:2006.16176 (2020).
  36. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
  37. Poisoning attacks on fair machine learning. In Database Systems for Advanced Applications: 27th International Conference, DASFAA 2022, Virtual Event, April 11–14, 2022, Proceedings, Part I. Springer, 370–386.
  38. Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop. 88–93.
  39. Detection of abusive language: the problem of biased datasets. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers). 602–608.
  40. Demoting Racial Bias in Hate Speech Detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, Online, 7–14. https://doi.org/10.18653/v1/2020.socialnlp-1.2
  41. Towards Fair Classification against Poisoning Attacks. arXiv preprint arXiv:2210.09503 (2022).
  42. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 335–340.
  43. Demographics should not be the reason of toxicity: Mitigating discrimination in text classifications with instance weighting. arXiv preprint arXiv:2004.14088 (2020).
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.