Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach
2024-06-25

Summary
This paper discusses a method called Direct Preference Optimization (DPO) that helps reduce harmful or toxic outputs from multilingual large language models (LLMs). It shows that training these models with English data can effectively improve their performance in other languages.
What's the problem?
As large language models are used more globally, they often produce toxic or harmful content in multiple languages. Previous research has shown that methods to reduce toxicity in one language do not always work well in others, making it challenging to ensure safety across different languages.
What's the solution?
The authors demonstrate that by using DPO training on English data, they can significantly decrease the chances of generating toxic content in 17 different languages. For example, after training, the likelihood of the mGPT-1.3B model producing toxic responses dropped from 46.8% to just 3.9%. They also found that this approach works well with other multilingual models like BLOOM and Llama3. Additionally, they used tools to analyze how the model's layers functioned, which helped explain why this method works across different languages.
Why it matters?
This research is important because it provides a way to make multilingual language models safer and more reliable without needing extensive data for each language. By improving the ability of these models to handle toxic content, it helps ensure that AI systems can be used responsibly and effectively in diverse global contexts.
Abstract
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.