Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence (2309.00751v1)

Published 1 Sep 2023 in cs.CL

Abstract: Due to LLMs' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. In this work, we apply popular detoxification approaches to several LLMs and quantify their impact on the resulting models' prompt dependence using feature attribution methods. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Daniel Scalena (4 papers)
  2. Gabriele Sarti (21 papers)
  3. Malvina Nissim (52 papers)
  4. Elisabetta Fersini (8 papers)