Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DetoxLLM: A Framework for Detoxification with Explanations (2402.15951v2)

Published 25 Feb 2024 in cs.LG, cs.CL, and cs.CY

Abstract: Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of developing detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. Additionally, these works do not address non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified without altering the meaning. We propose DetoxLLM, the first comprehensive end-to-end detoxification framework, which attempts to alleviate the aforementioned limitations. We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies leveraging ChatGPT. We then train a suite of detoxification models with our cross-platform corpus. We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus. We further introduce explanation to promote transparency and trustworthiness. DetoxLLM additionally offers a unique paraphrase detector especially dedicated for the detoxification task to tackle the non-detoxifiable cases. Through experimental analysis, we demonstrate the effectiveness of our cross-platform corpus and the robustness of DetoxLLM against adversarial toxicity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Alex Albert. 2023. Jailbreak chat. https://www.jailbreakchat.com. Accessed: 2023-11-21.
  2. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
  3. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
  4. ConvAbuse: Data, analysis, and benchmarks for nuanced abuse detection in conversational AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7388–7403, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  5. How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  8. Improving cyberbullying detection with user context. In Proceedings of the 35th European Conference on IR Research, ECIR 2013, Lecture Notes in Computer Science, pages 693–696, Netherlands. Springer.
  9. SkoltechNLP at SemEval-2021 task 5: Leveraging sentence-level pre-training for toxic span detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 927–934, Online. Association for Computational Linguistics.
  10. Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media, 11(1):512–515.
  11. Hate speech dataset from a white supremacy forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 11–20, Brussels, Belgium. Association for Computational Linguistics.
  12. Exploring methods for cross-lingual text style transfer: The case of text detoxification. ArXiv, abs/2311.13937.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  14. William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  15. Hate lingo: A target-based linguistic analysis of hate speech in social media. Proceedings of the International AAAI Conference on Web and Social Media, 12(1).
  16. Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6786–6794, Marseille, France. European Language Resources Association.
  17. Large scale crowdsourcing and characterization of twitter abusive behavior. Proceedings of the International AAAI Conference on Web and Social Media, 12(1).
  18. Lei Gao and Ruihong Huang. 2017. Detecting online hate speech using context aware models. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 260–266, Varna, Bulgaria. INCOMA Ltd.
  19. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. A large labeled corpus for online harassment research. In Proceedings of the 2017 ACM on Web Science Conference, WebSci ’17, page 229–233, New York, NY, USA. Association for Computing Machinery.
  21. Reinforcement learning based text style transfer without parallel training corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3168–3180, Minneapolis, Minnesota. Association for Computational Linguistics.
  22. All you need is "love": Evading hate speech detection. AISec ’18, page 2–12, New York, NY, USA. Association for Computing Machinery.
  23. Steer: Unified style transfer with expert reinforcement. ArXiv, abs/2311.07167.
  24. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
  25. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, Toronto, Canada. Association for Computational Linguistics.
  26. Akshita Jha and Radhika Mamidi. 2017. When does a compliment become sexist? analysis and classification of ambivalent sexism using twitter data. In Proceedings of the Second Workshop on NLP and Computational Social Science, pages 7–16, Vancouver, Canada. Association for Computational Linguistics.
  27. Jigsaw. 2018. Jigsaw toxic comment classification. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge. Accessed: 2023-11-21.
  28. A just and comprehensive strategy for using NLP to address online abuse. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3658–3666, Florence, Italy. Association for Computational Linguistics.
  29. Mladen Karan and Jan Šnajder. 2018. Cross-domain detection of abusive language online. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 132–137, Brussels, Belgium. Association for Computational Linguistics.
  30. Cross-platform and cross-domain abusive language detection with supervised contrastive learning. In The 7th Workshop on Online Abuse and Harms (WOAH), pages 96–112, Toronto, Canada. Association for Computational Linguistics.
  31. GPTAraEval: A comprehensive evaluation of ChatGPT on Arabic NLP. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 220–247, Singapore. Association for Computational Linguistics.
  32. Wilds: A benchmark of in-the-wild distribution shifts. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5637–5664. PMLR.
  33. Civil rephrases of toxic texts with self-supervised transformers. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1442–1461, Online. Association for Computational Linguistics.
  34. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  35. Multi-step jailbreaking privacy attacks on chatgpt. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  36. Semi-supervised formality style transfer with consistency training. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4689–4701, Dublin, Ireland. Association for Computational Linguistics.
  37. NULI at SemEval-2019 task 6: Transfer learning for offensive language detection using bidirectional transformers. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 87–91, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
  38. Non-parallel text style transfer with self-parallel supervision. In International Conference on Learning Representations.
  39. Jailbreaking chatgpt via prompt engineering: An empirical study. ArXiv, abs/2305.13860.
  40. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  41. ParaDetox: Detoxification with parallel data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804–6818, Dublin, Ireland. Association for Computational Linguistics.
  42. Politeness transfer: A tag and generate approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1869–1881, Online. Association for Computational Linguistics.
  43. Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages. FIRE ’19, page 14–17, New York, NY, USA. Association for Computing Machinery.
  44. Did you offend me? classification of offensive tweets in Hinglish language. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 138–148, Brussels, Belgium. Association for Computational Linguistics.
  45. Ethos: an online hate speech detection dataset. Complex & Intelligent Systems.
  46. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
  47. Polite chatbot: A text style transfer application. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 87–93, Dubrovnik, Croatia. Association for Computational Linguistics.
  48. Karsten Müller and Carlo Schwarz. 2017. Fanning the flames of hate: Social media and hate crime. SSRN Electronic Journal.
  49. Tong Niu and Mohit Bansal. 2018. Polite dialogue generation without parallel data. Transactions of the Association for Computational Linguistics, 6:373–389.
  50. Fighting offensive language on social media with unsupervised text style transfer. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 189–194, Melbourne, Australia. Association for Computational Linguistics.
  51. OpenAI. 2023a. Chatgpt. https://openai.com/blog/chatgpt. Accessed: 2023-11-21.
  52. OpenAI. 2023b. Moderation. https://platform.openai.com/docs/guides/moderation. Accessed: 2023-11-21.
  53. A call for standardization and validation of text style transfer evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10791–10815, Toronto, Canada. Association for Computational Linguistics.
  54. Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4675–4684, Hong Kong, China. Association for Computational Linguistics.
  55. Low-resource authorship style transfer with in-context learning. ArXiv, abs/2212.08986.
  56. Dongqi Pu and Vera Demberg. 2023. ChatGPT vs human-authored text: Insights into controllable text summarization and sentence style transfer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 1–18, Toronto, Canada. Association for Computational Linguistics.
  57. A benchmark dataset for learning to intervene in online hate speech. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4755–4764, Hong Kong, China. Association for Computational Linguistics.
  58. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  59. Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
  60. Machel Reid and Victor Zhong. 2021. LEWIS: Levenshtein editing for unsupervised text style transfer. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3932–3944, Online. Association for Computational Linguistics.
  61. A recipe for arbitrary text style transfer with large language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
  62. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  63. Characterizing and detecting hateful users on twitter. In Twelfth international AAAI conference on web and social media.
  64. Data integration for toxic comment classification: Making more than 40 datasets easily accessible in one unified format. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 157–163, Online. Association for Computational Linguistics.
  65. HateCheck: Functional tests for hate speech detection models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online. Association for Computational Linguistics.
  66. Anatomy of online hate: Developing a taxonomy and machine learning models for identifying and classifying hate in online news media. Proceedings of the International AAAI Conference on Web and Social Media, 12(1).
  67. Developing an online hate classifier for multiple social media platforms. Human-centric Computing and Information Sciences, 10:1–34.
  68. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy. Association for Computational Linguistics.
  69. Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pages 1–10, Valencia, Spain. Association for Computational Linguistics.
  70. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  71. A4nt: Author attribute anonymity by adversarial training of neural machine translation. In USENIX Security Symposium.
  72. Studying generalisability across abusive language detection datasets. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 940–950, Hong Kong, China. Association for Computational Linguistics.
  73. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  74. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  75. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  76. Bertie Vidgen and Leon Derczynski. 2020. Directions in abusive language training data, a systematic review: Garbage in, garbage out. PLoS ONE, 15(12):e0243300.
  77. Controllable unsupervised text attribute transfer via editing entangled latent representation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  78. Self-instruct: Aligning language model with self generated instructions. ArXiv, abs/2212.10560.
  79. William Warner and Julia Hirschberg. 2012. Detecting hate speech on the world wide web. In Proceedings of the Second Workshop on Language in Social Media, LSM ’12, page 19–26, USA. Association for Computational Linguistics.
  80. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
  81. Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop, pages 88–93, San Diego, California. Association for Computational Linguistics.
  82. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  83. Hate in the Machine: Anti-Black and Anti-Muslim Social Media Posts as Predictors of Offline Racially and Religiously Aggravated Crime. The British Journal of Criminology, 60(1):93–117.
  84. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  85. Lamini-lm: A diverse herd of distilled models from large-scale instructions.
  86. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web, pages 1391–1399.
  87. Learning from bullying traces in social media. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 656–666, Montréal, Canada. Association for Computational Linguistics.
  88. Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1415–1420, Minneapolis, Minnesota. Association for Computational Linguistics.
  89. Parallel data augmentation for formality style transfer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3221–3228, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
Citations (2)

Summary

Comprehensive Framework for Cross-Platform Detoxification and Handling Non-Detoxifiability

Introduction to GreenLLaMA

In the evolving landscape of online communication, addressing toxic language has become imperative. The proliferation of such content across different platforms underscores the need for versatile detoxification strategies that not only mitigate toxicity but also preserve the integrity of the original message. GreenLLaMA emerges as a pioneering framework aimed at addressing these challenges. It introduces a comprehensive end-to-end solution for detoxifying online content, transcending the limitations of existing models. Specifically, it navigates the intricacies of cross-platform detoxification, elucidates the motivations behind toxic expressions, and adeptly handles non-detoxifiable content.

Cross-Platform Detoxification

GreenLLaMA delineates a cross-platform approach to detoxification, addressing the linguistic variability inherent across different social media platforms. By leveraging ChatGPT for data generation, this framework develops a pseudo-parallel corpus that encapsulates a diverse set of toxic and non-toxic interactions. This corpus stands as a cornerstone for training detoxification models, ensuring they exhibit robust performance across platforms. Such an approach not only broadens the applicability of detoxification models but also enhances their adaptability to platform-specific linguistic nuances.

Transparency through Explanation

A novel aspect of GreenLLaMA is its commitment to transparency. This framework distinctly incorporates explanations for identifying content as toxic, thus fostering trust and clarity. By doing so, it not only aids in the immediate detoxification process but also contributes to a broader understanding of what constitutes harmful language. This feature is instrumental in educating users and platforms alike, promoting healthier online interactions.

Tackling Non-Detoxifiability

GreenLLaMA acknowledges and addresses the challenge of non-detoxifiability—a scenario where detoxifying content compromises its original meaning. To this end, it integrates a dedicated paraphrase detector that distinguishes between detoxifiable and non-detoxifiable cases. In instances of non-detoxifiability, GreenLLaMA provides warnings, deftly navigating the delicate balance between content moderation and preserving communicative intent.

Empirical Validation

Experimental analyses underscore GreenLLaMA's efficacy. The framework demonstrates superior performance in cross-platform detoxification, outpacing state-of-the-art models while maintaining content integrity and fluency. Additionally, its unique paraphrase detector exhibits remarkable precision in identifying non-detoxifiability, highlighting the framework's nuanced understanding of content moderation challenges.

Implications and Future Directions

GreenLLaMA's contributions extend beyond immediate practical applications. The framework sets a precedent for integrating explainability and handling non-detoxifiability in content moderation tasks. Its cross-platform applicability signifies a step towards universal detoxification solutions, adaptable across the diverse landscape of online platforms. Future research may explore refining explanation mechanisms and further enhancing the robustness of detoxification models against evolving forms of toxic language.

GreenLLaMA heralds a new era in content moderation, echoing the need for comprehensive, adaptable, and transparent detoxification strategies. Its pioneering approach to tackling online toxicity, coupled with its embrace of cross-platform challenges and commitment to transparency, positions the framework as a cornerstone in the ongoing endeavor to cultivate healthier online communities.

X Twitter Logo Streamline Icon: https://streamlinehq.com