Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models can be Strong Self-Detoxifiers (2410.03818v1)

Published 4 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reducing the likelihood of generating harmful and toxic output is an essential task when aligning LLMs. Existing methods mainly rely on training an external reward model (i.e., another LLM) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textit{Self-disciplined Autoregressive Sampling (SASA)}, a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM's internal representations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ching-Yun Ko (19 papers)
  2. Pin-Yu Chen (311 papers)
  3. Payel Das (104 papers)
  4. Youssef Mroueh (66 papers)
  5. Soham Dan (41 papers)
  6. Georgios Kollias (17 papers)
  7. Subhajit Chaudhury (40 papers)
  8. Tejaswini Pedapati (31 papers)
  9. Luca Daniel (47 papers)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com