Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem

2025-08-29

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Summary

This paper explores a new way to make large language models (LLMs) safer by strengthening their ability to refuse harmful requests, without needing extensive retraining.

What's the problem?

LLMs are designed to be helpful, but they can sometimes be tricked into generating harmful content. Researchers have found that even the safety features built into these models can be bypassed by subtly changing how the model processes information internally. Essentially, the 'off switch' for harmful responses can be disabled.

What's the solution?

The researchers developed a technique called Rank-One Safety Injection, or ROSI. Instead of trying to *remove* the ability to generate harmful content, ROSI *boosts* the model's existing safety mechanisms. It does this by making a small, permanent change to the model's internal settings – specifically, the way information flows through the model. This change pushes the model to more strongly activate the parts of its 'brain' that are responsible for refusing harmful requests. Importantly, this is done without any further training of the model, making it efficient.

Why it matters?

This work is important because it offers a simple and effective way to improve the safety of LLMs. It’s a relatively 'cheap' fix compared to retraining the entire model, and it can even be used to make models that were previously 'uncensored' more safe. This could be a valuable tool for developers looking to deploy LLMs responsibly, ensuring they are less likely to generate dangerous or inappropriate responses.

Abstract

Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.

View Paper