Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unveiling the Implicit Toxicity in Large Language Models (2311.17391v1)

Published 29 Nov 2023 in cs.CL

Abstract: The open-endedness of LLMs combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. Moreover, we propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs. Specifically, we optimize the LLM with a reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones. Experiments on five widely-adopted toxicity classifiers demonstrate that the attack success rate can be significantly improved through RL fine-tuning. For instance, the RL-finetuned LLaMA-13B model achieves an attack success rate of 90.04% on BAD and 62.85% on Davinci003. Our findings suggest that LLMs pose a significant threat in generating undetectable implicit toxic outputs. We further show that fine-tuning toxicity classifiers on the annotated examples from our attacking method can effectively enhance their ability to detect LLM-generated implicit toxic language. The code is publicly available at https://github.com/thu-coai/Implicit-Toxicity.

Unveiling the Implicit Toxicity in LLMs

The research conducted by Jiaxin Wen et al. explores the nuanced problem area of implicit toxicity in LLMs. Contrary to the prevailing focus of explicit toxicity, the authors explore the subtle and often undetected threats these models pose due to their ability to generate implicit toxic responses. This paper examines the capacity of LLMs to express harmful content subtly, which current toxicity classifiers struggle to identify effectively.

Overview

The paper unfolds with a thorough exploration of the vulnerabilities within existing toxicity classifiers in detecting implicit toxic outputs generated by LLMs. Recent advancements in large-scale LLM pre-training underscore their open-ended nature, leading to potential misuse in the generation of harmful content. The authors argue that the implicit toxic outputs of LLMs may surpass the threat posed by explicitly toxic language due to their elusive nature in detection.

The experimental design involves zero-shot prompting with GPT-3.5-turbo to generate implicit toxic responses. The results highlight a concerning attack success rate, revealing that state-of-the-art toxicity classifiers, including Google's Perspective-API and OpenAI's Moderation tool, are vulnerable when confronted with these nuanced toxic language cues. They further introduce a reinforcement learning (RL) based approach to enhance the ability of LLMs to generate such implicit toxic content. Through RL-based optimization with a preference for implicit over explicit toxic outputs, the paper reports significant improvements in attack success rates across five widely-adopted toxicity classifiers.

Key Findings

  1. Implicit Toxicity Challenge: The paper documents that LLMs can produce implicitly toxic content, which is significantly challenging for existing classifiers to detect, achieving attack success rates of up to 96.69% with open-ended models like GPT-3.5-turbo.
  2. Reinforcement Learning Approach: The proposed RL method illustrates a successful strategy to further induce implicit toxicity in LLMs, with the LLaMA-13B model reaching an attack success rate of 90.04% and 62.85% across different classifier setups.
  3. Classifier Enhancement: By fine-tuning toxicity classifiers on annotated examples from their attacking methods, the paper demonstrates a technique to improve the classifiers' ability to detect implicit toxic language, suggesting practical steps to mitigate the identified risks.

Implications

The implications of this research are crucial, as it exposes a significant gap in the safety measures surrounding LLM deployment. The findings illuminate the need for advanced toxicity classifiers capable of identifying implicit toxic patterns. This drives home the necessity for continual improvement in detection algorithms to ensure these models can be safely integrated into operational settings without inadvertently harboring toxic language that evades detection.

Theoretically, the paper underscores the importance of reinforcing the frameworks that govern RL strategies within LLMs, advocating for a balanced approach to optimization that maintains safety standards while harnessing the models' predictive capabilities.

Future developments in AI could see increased collaboration across ethical and technical disciplines to counter these challenges, with this research serving as a cornerstone for refining the methodologies that safeguard AI deployments.

In conclusion, this paper reveals a critical safety threat in LLMs, highlighting potential measures to counteract implicit toxicity through enhanced classifier techniques and reinforcing the importance of multidisciplinary approaches in AI ethics and governance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Extracting training data from large language models. In USENIX Security Symposium, volume 6.
  4. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Automated hate speech detection and the problem of offensive language. In Proceedings of the Eleventh International Conference on Web and Social Media, pages 512–515.
  7. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  8. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4536–4545.
  9. Latent hatred: A benchmark for understanding implicit hate speech. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 345–363.
  10. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  11. Jane Frank. 1990. You call that a rhetorical question?: Forms and functions of rhetorical questions in conversation. Journal of Pragmatics, 14(5):723–738.
  12. The unbearable hurtfulness of sarcasm. Expert Systems with Applications, 193:116398.
  13. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764.
  14. Lei Gao and Ruihong Huang. 2017. Detecting online hate speech using context aware models. In Proceedings of the International Conference Recent Advances in Natural Language Processing, pages 260–266.
  15. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369.
  16. Google. 2023. perspective.
  17. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326.
  18. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31.
  19. Can machines learn morality? the delphi experiment. arXiv e-prints, pages arXiv–2110.
  20. Improving hate speech type and target detection with hateful metaphor features. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, pages 7–16.
  21. A diversity-promoting objective function for neural conversation models. In The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
  22. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  24. Rijul Magu and Jiebo Luo. 2018. Determining code words in euphemistic hate speech using word embedding networks. In Proceedings of the 2nd workshop on abusive language online (ALW2), pages 93–100.
  25. An in-depth analysis of implicit and subtle hate speech messages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1989–2005.
  26. OpenAI. 2022. Introducing chatgpt.
  27. OpenAI. 2023a. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  28. OpenAI. 2023b. moderation.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  30. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448.
  31. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
  32. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2463–2473.
  33. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325.
  34. Nigora Ruzibaeva. 2019. Peculiarities of the antithesis in the literary text. European Journal of Research and Reflection in Educational Sciences Vol, 7(11).
  35. Social bias frames: Reasoning about social and power implications of language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5477–5490.
  36. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  37. Rohit Sridhar and Diyi Yang. 2022. Explaining toxic text via knowledge enhanced text generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 811–826.
  38. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  39. Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  41. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  42. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.
  43. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278.
  44. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
  45. ETHICIST: Targeted training data extraction through loss smoothed soft prompting and calibrated confidence estimation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12674–12687, Toronto, Canada. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiaxin Wen (16 papers)
  2. Pei Ke (37 papers)
  3. Hao Sun (383 papers)
  4. Zhexin Zhang (26 papers)
  5. Chengfei Li (3 papers)
  6. Jinfeng Bai (31 papers)
  7. Minlie Huang (225 papers)
Citations (19)
Youtube Logo Streamline Icon: https://streamlinehq.com