Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu

2025-10-22

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Summary

This paper investigates a weakness in large language models (LLMs) where they can be tricked into generating harmful content if the request for harm isn't immediately obvious at the start of their response. The researchers then introduce a method called Any-Depth Alignment (ADA) to fix this issue, making LLMs safer throughout the entire generation process.

What's the problem?

LLMs are pretty good at refusing to answer harmful questions *if* those questions are directly asking for something bad right away. However, if someone cleverly crafts a prompt that slowly leads the LLM towards harmful territory, or starts with a harmless request and then subtly shifts it, the LLM's safety mechanisms often fail. This means the model can end up generating dangerous or inappropriate content even though it's supposed to be safe. Essentially, the safety is only 'skin deep' and doesn't hold up when the conversation gets more complex.

What's the solution?

The researchers noticed that LLMs seem to have a strong sense of what's right and wrong built into the very first few 'words' (technically, tokens) they generate when responding to a prompt. They call this the 'assistant header'. ADA works by strategically re-inserting these initial, safe tokens *midway* through the LLM's response. This acts like a reminder, prompting the model to re-evaluate whether it's still on a safe path and encouraging it to refuse if it's starting to generate something harmful. Importantly, this method doesn't require any changes to the LLM itself – it's applied during the response generation process.

Why it matters?

This research is important because it addresses a significant safety concern with LLMs. As these models become more powerful and are used in more applications, it's crucial to ensure they don't generate harmful content. ADA offers a practical and effective way to improve the safety of LLMs without needing to retrain them, making it a valuable tool for developers and users alike. It also shows that the initial safety 'programming' of these models is more powerful than previously thought and can be leveraged to improve overall safety.

Abstract

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).

View Paper