Steering When Necessary: Flexible Steering Large Language Models with Backtracking
Jinwei Gan, Zifeng Cheng, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu
2025-08-27
Summary
This paper introduces a new method called Flexible Activation Steering with Backtracking, or FASB, to make large language models (LLMs) behave more predictably and truthfully without needing to completely retrain them.
What's the problem?
Large language models are really good at creating text, but getting them to consistently provide accurate and helpful responses is hard. Current methods for guiding these models either change everything they generate, or only look at the initial question to decide how much to change, which isn't very precise. They often react *after* a mistake is made, making it harder to correct.
What's the solution?
FASB works by carefully watching what the language model is doing *as* it generates text. It checks both the original question and the words the model is currently creating to figure out if and how much intervention is needed to keep the response on track. If it detects the model starting to go wrong, it doesn't just change the next word, but actually goes back and adjusts previous words to steer the model towards a better outcome. This 'backtracking' helps correct errors more effectively.
Why it matters?
This research is important because it offers a cheaper and more efficient way to improve the reliability of large language models. Instead of expensive retraining, FASB can adjust the model's behavior on the fly, making it more trustworthy and useful for a variety of applications. The experiments show it performs better than existing methods at getting truthful answers and solving multiple-choice questions.
Abstract
Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.