Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

2025-01-30

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Summary

This paper talks about a new way to trick AI language models into behaving badly, even when there are safety measures in place. The researchers created a method called 'Virus' that can sneak harmful information past the AI's security guards.

What's the problem?

Big AI language models can be made to say or do bad things if they're given harmful information during their training. To prevent this, companies use 'guardrails' to filter out bad data. But these guardrails aren't perfect, and people might still find ways to slip dangerous information through.

What's the solution?

The researchers developed a method called Virus that can change harmful data just enough so that the guardrails don't catch it. This modified data can still make the AI behave badly, but it looks innocent to the safety filters. In their tests, Virus was able to get all of its harmful data past the guardrails without being detected.

Why it matters?

This matters because it shows that current safety measures for AI aren't good enough. It's like finding out that a school's metal detectors can be fooled by wrapping weapons in tin foil. This discovery is a wake-up call for AI developers, showing they need to create better ways to keep AI safe and well-behaved. It also warns us not to rely too much on simple safety filters, as clever attackers might find ways around them. Understanding these risks can help make future AI systems more secure and trustworthy.

Abstract

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

View Paper