Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction
Yushi Yang, Filip Sondej, Harry Mayne, Adam Mahdi
2024-11-12

Summary
This paper explores how a technique called Direct Preference Optimization (DPO) helps reduce harmful outputs in language models by examining the role of different neurons in the model's decision-making process.
What's the problem?
Language models can sometimes produce toxic or harmful content, and while safety fine-tuning methods are used to improve this, it’s not clear how these methods work internally. Previous explanations suggested that DPO reduces toxicity mainly by dampening the activity of the most harmful neurons, but this understanding is incomplete and doesn't fully explain how toxicity is reduced.
What's the solution?
The authors conducted experiments to investigate the actual mechanisms behind DPO. They found that only about 31.8% of the reduction in toxicity comes from dampening toxic neurons. Instead, DPO works by balancing effects across multiple groups of neurons. This means that while some neurons reduce harmful writing, others might actually increase it, creating a complex interaction. The study shows that DPO is more about managing these opposing effects rather than just turning down the bad neurons.
Why it matters?
This research is important because it provides a deeper understanding of how language models can be made safer and more effective. By uncovering the true dynamics of neuron interactions during toxicity reduction, developers can create better strategies for fine-tuning models, leading to AI that produces less harmful content and operates more reliably in real-world applications.
Abstract
Safety fine-tuning algorithms are commonly used to fine-tune language models to reduce harmful outputs, but the exact internal mechanisms of how those models achieve this remain unclear. In studying direct preference optimisation (DPO) for toxicity reduction, current explanations claim that DPO works by dampening the most toxic MLP neurons to learn an offset to avert toxic regions in the residual stream. However, by ablating the most toxic neurons and applying activation patching, we find this explanation incomplete. By projecting neuron activation changes onto a toxicity probe, we find that only 31.8\% of toxicity reduction comes from dampened toxic neurons. Instead, DPO reduces toxicity by accumulating effects across multiple neuron groups, both reducing writing in the toxic direction and promoting anti-toxicity in the residual stream. Moreover, DPO gives noisy adjustments to neuron activations, with many neurons actually increasing toxicity. This indicates that DPO is a balancing process between opposing neuron effects to achieve toxicity reduction.