How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis (2411.06424v3)

Published 10 Nov 2024 in cs.LG and cs.CL

Abstract: Safety fine-tuning algorithms reduce harmful outputs in LLMs, yet their mechanisms remain under-explored. Direct Preference Optimization (DPO) is a popular choice of algorithm, but prior explanations, attributing its effects solely to dampened toxic neurons in the MLP layers, are incomplete. In this study, we analyse four LLMs (Llama-3.1-8B, Gemma-2-2B, Mistral-7B, GPT-2-Medium) and show that toxic neurons only account for 2.5% to 24% of DPO's effects across models. Instead, DPO balances distributed activation shifts across all MLP neurons to create a net toxicity reduction. We attribute this reduction to four neuron groups, two aligned with reducing toxicity and two promoting anti-toxicity, whose combined effects replicate DPO across models. To further validate this understanding, we develop an activation editing method mimicking DPO through distributed shifts along a toxicity representation. This method outperforms DPO in reducing toxicity while preserving perplexity, without requiring any weight updates. Our work provides a mechanistic understanding of DPO and introduces an efficient, tuning-free alternative for safety fine-tuning.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis (2411.06424v3)

Collections

Summary

Follow-up Questions

Authors (5)

Don't miss out on important new AI/ML research

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis (2411.06424v3)

Collections

Summary

Follow-up Questions

Related Papers

Authors (5)

Don't miss out on important new AI/ML research