Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee

2024-12-30

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Summary

This paper talks about a method for improving the performance of large language models (LLMs) while keeping them safe from generating harmful content. The method involves merging models that have been fine-tuned for specific tasks.

What's the problem?

When LLMs are fine-tuned for specific tasks, their ability to ensure safety can decrease, meaning they might produce harmful or inappropriate responses. Many current solutions try to solve this by adding more safety data, but this can be impractical and time-consuming. There's a need for a way to enhance model performance without compromising safety.

What's the solution?

The authors propose a two-step approach called pre- and post-tuning model merging. First, they fine-tune the original LLM on a specific task. Then, they merge the original model with the fine-tuned version by combining their weights. This merging helps preserve the safety features of the original model while improving its performance on the new task. Their experiments show that this method effectively reduces safety issues while enhancing task performance across various models.

Why it matters?

This research is important because it offers a practical solution for maintaining safety in AI systems while still allowing them to perform well on specific tasks. By improving how LLMs can be adapted for different uses without sacrificing safety, this method can help make AI tools more reliable and trustworthy in real-world applications.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.

View Paper