ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Yein Park, Jungwoo Park, Jaewoo Kang
2026-04-17
Summary
This paper investigates a weakness in large language models (LLMs) where they can be tricked into doing harmful things simply by changing the wording of a request, even if they're designed to be safe. It then proposes a way to fix this specific problem without messing up the model's overall performance.
What's the problem?
Even though LLMs are built with safety measures, they're surprisingly easy to 'jailbreak'. This means someone can get the model to generate harmful content by making small changes to the request, like changing the tense of verbs (past tense versus present tense). This shows the safety mechanisms aren't as robust as they need to be and we don't fully understand *why* these models fail in these situations. Essentially, the models are refusing requests based on surface-level cues instead of actual understanding of harm.
What's the solution?
The researchers developed a technique called Activation-Scaling Guard (ASGuard). First, they figured out exactly *which* parts of the model's 'brain' (specifically, attention heads) are responsible for the weak refusal behavior when the tense is changed. Then, they adjusted how strongly those specific parts respond to input, making them less susceptible to being tricked. Finally, they retrained the model a little bit to solidify this improved refusal mechanism. This was done across several different LLMs.
Why it matters?
This work is important because it shows that we can improve the safety of LLMs by looking *inside* the model to understand how it works, rather than just trying to train it with more examples. It provides a targeted and efficient way to address a specific vulnerability, and it balances safety with the model's ability to still be useful. It points towards a future where AI safety is more reliable and we can better understand *why* an AI makes a certain decision.
Abstract
Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.