Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models (2402.15202v2)

Published 23 Feb 2024 in cs.CL

Abstract: Impressive results have been achieved in NLP tasks through the training of LLMs. However, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. To tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. However, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. In this paper, we propose fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. Specifically, FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels. Our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901.
  2. Dangxing Chen and Luyao Zhang. 2023. Monotonicity for ai ethics and society: An empirical study of the monotonic neural additive model in criminology, education, health care, and finance. arXiv preprint arXiv:2301.07060.
  3. Lift yourself up: Retrieval-augmented text generation with self memory. arXiv preprint arXiv:2305.02437.
  4. Fft: Towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity. arXiv preprint arXiv:2311.18580.
  5. Model merging by uncertainty-based gradient matching. arXiv preprint arXiv:2310.12808.
  6. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  7. A mathematical framework for transformer circuits.
  8. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In In Findings of the Association for Computational Linguistics, pages 3356–3369.
  9. Don’t stop pretraining: Adapt language models to domains and tasks. In In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360.
  10. Detoxifying text with marco: Controllable revision with experts and anti-experts. In In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.
  11. Geoffrey E. Hinton. 2002. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771–1800.
  12. The curious case of neural text degeneration. In In Proceedings of the Eighth International Conference on Learning Representations.
  13. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation. arXiv preprint arXiv:2308.08090.
  14. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations. OpenReview.net.
  15. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
  16. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  17. Tassilo Klein and Moin Nabi. 2024. Contrastive perplexity for controlled generation: An application in detoxifying large language models. arXiv preprint arXiv:2401.08491.
  18. Gedi: Generative discriminator guided sequence generation. In In Findings of the Association for Computational Linguistics, pages 4929–4952.
  19. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054.
  20. Language detoxification with attribute-discriminative latent space. In In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 10149–10171.
  21. Self-detoxifying language models via toxification reversal. In In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4433–4449.
  22. A diversity-promoting objective function for neural conversation models. In In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, pages 110–119.
  23. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 6691–6706.
  24. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668.
  25. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  26. Diagnosing and debiasing corpus-based political bias and insults in gpt2. arXiv preprint arXiv:2311.10266.
  27. Parameter-efficient detoxification with contrastive decoding. arXiv preprint arXiv:2401.06947.
  28. Goodtriever: Adaptive toxicity mitigation with retrieval-augmented models. In In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10149–10171.
  29. Controllable natural language generation with contrastive prefixes. In In Findings of the Association for Computational Linguistics, page 2912–2924.
  30. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  31. Scaling language models: Methods, analysis insights from training gopher. arXiv preprint arXiv:2112.11446.
  32. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Trans. Assoc. Comput. Linguistics, 9:1408–1424.
  33. On the safety of conversational models: Taxonomy, dataset, and benchmark. In Proceedings of the 5th Workshop on Online Abuse and Harms, pages 3906–3923.
  34. Detoxify language model step-by-step. arXiv preprint arXiv:2308.08295.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  36. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. In Advances in Neural Information Processing Systems, volume 35, pages 35811–35824.
  37. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  38. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  39. Unveiling the implicit toxicity in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1322–1338.
  40. Context sensitivity estimation in toxicity detection. arXiv preprint arXiv:2311.18580., pages 140–3145.
  41. TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
  42. Adamerging: Adaptive model merging for multi-task learning. arXiv preprint arXiv:2310.02575.
  43. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099.
  44. Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv:2306.14870.
  45. Mil-decoding: Detoxifying language models at token-level via multiple instance learning. In In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 190–202.
  46. Prompt-driven LLM safeguarding via directed representation optimization. arXiv preprint arXiv:2401.18018.
  47. Controlled text generation with natural language instructions. In International Conference on Machine Learning, pages 42602–42613.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xin Yi (37 papers)
  2. Linlin Wang (35 papers)
  3. Xiaoling Wang (42 papers)
  4. Liang He (202 papers)
Citations (1)