Mitigating Text Toxicity with Counterfactual Generation (2405.09948v2)
Abstract: Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural NLP models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply counterfactual generation methods from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In particular, we perform text detoxification by applying local feature importance and counterfactual generation methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators can mitigate toxicity accurately while better preserving the meaning of the initial text as compared to classical detoxification methods. Finally, we take a step back from using automated detoxification tools, and discuss how to manage the polysemous nature of toxicity and the risk of malicious use of detoxification tools. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical application of XAI methods.
- Internet, social media and online hate speech. systematic review. Aggression and violent behavior, 58:101608, 2021.
- Sok: Hate, harassment, and the changing landscape of online abuse. In 2021 IEEE Symposium on Security and Privacy (SP), pages 247–267, 2021. doi:10.1109/SP40001.2021.00028.
- Joseph Walther. Social media and online hate. Current Opinion in Psychology, 45, 01 2022. doi:10.1016/j.copsyc.2021.12.010.
- Kyle Rapp. Social media and genocide: The case for home state responsibility. Journal of Human Rights, 20(4):486–502, 2021.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020. URL https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proc. of the 2021 ACM Conf. on Fairness, Accountability, and Transparency, FAccT ’21, pages 610–623, New York, NY, USA, March 2021. Association for Computing Machinery. ISBN 978-1-4503-8309-7. doi:10.1145/3442188.3445922. URL https://dl.acm.org/doi/10.1145/3442188.3445922.
- A Survey on Bias and Fairness in Machine Learning, January 2022. URL http://arxiv.org/abs/1908.09635. arXiv:1908.09635 [cs].
- Bias and Fairness in Large Language Models: A Survey, September 2023. URL http://arxiv.org/abs/2309.00770. arXiv:2309.00770 [cs].
- Combining generative artificial intelligence (ai) and the internet: Heading towards evolution or degradation? arXiv preprint arXiv:2303.01255, 2023.
- Tigtec: Token importance guided text counterfactuals. In Proc. of the European Conf. on Machine Learning ECML-PKDD, page 496–512. Springer, 2023a. URL https://doi.org/10.1007/978-3-031-43418-1_30.
- Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts. In Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 228–242. Association for Computational Linguistics, July 2023. URL https://aclanthology.org/2023.acl-short.21.
- Text Detoxification using Large Pre-trained Neural Models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing, pages 7979–7996. Association for Computational Linguistics, November 2021. doi:10.18653/v1/2021.emnlp-main.629. URL https://aclanthology.org/2021.emnlp-main.629.
- A survey on automatic detection of hate speech in text. ACM Comput. Surv., 51(4), jul 2018. ISSN 0360-0300. doi:10.1145/3232676. URL https://doi.org/10.1145/3232676.
- Civil rephrases of toxic texts with self-supervised transformers. In Proc. of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1442–1461, 2021.
- Christoph Molnar. Interpretable Machine Learning. Lulu.com, 2020. ISBN 978-0-244-76852-2. URL https://christophm.github.io/interpretable-ml-book/.
- Riccardo Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking. Data Mining and Knowledge Discovery, April 2022. ISSN 1573-756X. doi:10.1007/s10618-022-00831-6. URL https://doi.org/10.1007/s10618-022-00831-6.
- Protecting children from online exploitation: Can a trained model detect harmful communication strategies? In Proc. of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, page 5–14. Association for Computing Machinery, 2023. doi:10.1145/3600211.3604696. URL https://doi.org/10.1145/3600211.3604696.
- Jigsaw. Toxic comment classification challenge. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge, 2018. Accessed: 2010-09-30.
- Bleu: a method for automatic evaluation of machine translation. In Proc. of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318. Association for Computational Linguistics, 2002. doi:10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81. Association for Computational Linguistics, July 2004. URL https://aclanthology.org/W04-1013.
- Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan, editors, Proc. of the Sixth Workshop on Statistical Machine Translation, pages 85–91. Association for Computational Linguistics, July 2011. URL https://aclanthology.org/W11-2107.
- GloVe: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics, October 2014. doi:10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162.
- Bertscore: Evaluating text generation with bert. In Int. Conf. on Learning Representations, 2019.
- Beyond bleu: Training neural machine translation with semantic similarity. In Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355, 2019.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Language Models are Unsupervised Multitask Learners. 2018.
- Text style transfer: A review and experimental evaluation. SIGKDD Explor. Newsl., 24(1):14–45, jun 2022. ISSN 1931-0145. doi:10.1145/3544903.3544906. URL https://doi.org/10.1145/3544903.3544906.
- Fighting offensive language on social media with unsupervised text style transfer. In Iryna Gurevych and Yusuke Miyao, editors, Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 189–194. Association for Computational Linguistics, July 2018. doi:10.18653/v1/P18-2031. URL https://aclanthology.org/P18-2031.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2019.
- Towards a friendly online community: An unsupervised style transfer framework for profanity redaction. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proc. of the 28th Int. Conf. on Computational Linguistics, pages 2107–2114. International Committee on Computational Linguistics, December 2020. doi:10.18653/v1/2020.coling-main.190. URL https://aclanthology.org/2020.coling-main.190.
- A unified approach to interpreting model predictions. In Proc. of the 31st Int. Conf. on Neural Information Processing Systems, NIPS’17, pages 4768–4777, December 2017. ISBN 978-1-5108-6096-4. URL https://papers.nips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf.
- Axiomatic attribution for deep networks. In Proc. of the 34th Int. Conf. on Machine Learning, ICML, volume 70 of ICML’17, pages 3319–3328. JMLR.org, August 2017. URL https://proceedings.mlr.press/v70/sundararajan17a/sundararajan17a.pdf.
- Evaluating self-attention interpretability through human-grounded experimental protocol. In Proc. of the First World Conf. on Explainable Artificial Intelligence xAI, pages 26–46, 2023b. URL http://arxiv.org/abs/2303.15190.
- Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence. Information Fusion, page 101805, April 2023. ISSN 1566-2535. doi:10.1016/j.inffus.2023.101805. URL https://www.sciencedirect.com/science/article/pii/S1566253523001148.
- Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38, February 2019. ISSN 0004-3702. doi:10.1016/j.artint.2018.07.007. URL https://www.sciencedirect.com/science/article/pii/S0004370218305988.
- Raphael Mazzine Barbosa de Oliveira and David Martens. A Framework and Benchmarking Study for Counterfactual Generating Methods on Tabular Data. Applied Sciences, 11(16):7274, January 2021. ISSN 2076-3417. doi:10.3390/app11167274. URL https://www.mdpi.com/2076-3417/11/16/7274. Number: 16 Publisher: Multidisciplinary Digital Publishing Institute.
- The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations. In Proc. of the Twenty-Eighth Int. Joint Conf. on Artificial Intelligence, pages 2801–2807. Int. Joint Conf. on Artificial Intelligence Organization, August 2019. ISBN 978-0-9992411-4-1. doi:10.24963/ijcai.2019/388. URL https://www.ijcai.org/proceedings/2019/388.
- FACE: Feasible and Actionable Counterfactual Explanations. In Proc. of the AAAI/ACM Conf. on AI, Ethics, and Society, pages 344–350, February 2020. doi:10.1145/3375627.3375850. URL http://arxiv.org/abs/1909.09369. arXiv:1909.09369 [cs, stat].
- Text Counterfactuals via Latent Optimization and Shapley-Guided Search. In Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing, pages 5578–5593. Association for Computational Linguistics, November 2021. doi:10.18653/v1/2021.emnlp-main.452. URL https://aclanthology.org/2021.emnlp-main.452.
- Explaining NLP Models via Minimal Contrastive Editing (MiCE). In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3840–3852, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.findings-acl.336. URL https://aclanthology.org/2021.findings-acl.336.
- CREST: A joint framework for rationalization and counterfactual text generation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15109–15126. Association for Computational Linguistics, July 2023. doi:10.18653/v1/2023.acl-long.842. URL https://aclanthology.org/2023.acl-long.842.
- Plug and Play Counterfactual Text Generation for Model Robustness, June 2022. URL http://arxiv.org/abs/2206.10429. arXiv:2206.10429 [cs].
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. Association for Computational Linguistics, November 2019. doi:10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
- Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63, August 2005. ISSN 0001-4966. doi:10.1121/1.2016299. URL https://doi.org/10.1121/1.2016299.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Anatomy of online hate: developing a taxonomy and machine learning models for identifying and classifying hate in online news media. In Proc. of the Int. AAAI Conf. on Web and Social Media, volume 12, 2018.
- Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Seyi Akiwowo, Bertie Vidgen, Vinodkumar Prabhakaran, and Zeerak Waseem, editors, Proc. of the Fourth Workshop on Online Abuse and Harms, pages 184–190. Association for Computational Linguistics, November 2020. doi:10.18653/v1/2020.alw-1.21. URL https://aclanthology.org/2020.alw-1.21.
- Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proc. of the 2022 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States, July 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.431. URL https://aclanthology.org/2022.naacl-main.431.
- Adversarial attack and defense: A survey. Electronics, 11(8):1283, 2022.
- Toward stronger textual attack detectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 484–505, 2023.
- Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.523. URL https://aclanthology.org/2021.acl-long.523.
- Creative language encoding under censorship. In Chris Brew, Anna Feldman, and Chris Leberknight, editors, Proc. of the First Workshop on Natural Language Processing for Internet Freedom, pages 23–33. Association for Computational Linguistics, August 2018. URL https://aclanthology.org/W18-4203.
- Tarleton Gillespie. Custodians of the Internet: Platforms, content moderation, and the hidden decisions that shape social media. Yale University Press, 2018.
- Personalizing content moderation on social media: User perspectives on moderation choices, interface design, and labor. Proc. ACM Hum.-Comput. Interact., 7(CSCW2), oct 2023. doi:10.1145/3610080. URL https://doi.org/10.1145/3610080.
- The psychological impacts of content moderation on content moderators: A qualitative study. Cyberpsychology: Journal of Psychosocial Research on Cyberspace, 17(4), 2023.
- Kate Crawford. The atlas of AI: Power, politics, and the planetary costs of artificial intelligence. Yale University Press, 2021.
- "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, KDD ’16, pages 1135–1144. Association for Computing Machinery, August 2016. ISBN 978-1-4503-4232-2. doi:10.1145/2939672.2939778. URL https://doi.org/10.1145/2939672.2939778.
- Learning important features through propagating activation differences. In Proc. of the 34th Int. Conf. on Machine Learning, ICML, volume 70 of ICML’17, pages 3145–3153. JMLR.org, August 2017. URL https://dl.acm.org/doi/10.5555/3305890.3306006.
- Milan Bhan (6 papers)
- Nina Achache (2 papers)
- Victor Legrand (2 papers)
- Nicolas Chesneau (10 papers)
- Annabelle Blangero (3 papers)
- Juliette Murris (3 papers)
- Marie-Jeanne Lesot (22 papers)
- Jean-Noel Vittaut (5 papers)