Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, Ruoxi Jia

2026-01-09

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Summary

This paper investigates a surprising finding about large language models (LLMs) – you can restore their safety features with just *one* example of what's considered a safe response, even after they've been made unsafe through further training.

What's the problem?

When you try to improve a language model by fine-tuning it (basically, giving it more training data), you can accidentally make it less safe. It might start generating harmful or inappropriate responses. Fixing this usually requires a lot of extra effort, needing tons of examples of safe responses and significant computing power, and often makes the model worse at its original tasks.

What's the solution?

The researchers discovered that you don't need a huge number of safety examples to fix this problem. In fact, just *one* carefully chosen example showing a safe response can effectively bring the safety features back, without hurting the model's overall performance. They found this works consistently, no matter how much harmful training the model received or how big the model itself is, and it happens quickly with only a few training steps. They believe this works because the information needed to restore safety is contained within a small, focused part of the model's internal workings.

Why it matters?

This is a big deal because it makes it much easier and cheaper to keep language models safe. Previously, maintaining safety while improving a model was a major challenge. This new method offers a simple and efficient way to correct safety issues, making it more practical to deploy and use these powerful AI tools responsibly.

Abstract

Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.

View Paper