SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang

2025-09-18

SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

Summary

This paper introduces a new way to test how well we can control what large language models say and do, focusing on making them safer and more reliable.

What's the problem?

Currently, when people try to 'steer' these models – meaning, change their behavior to avoid things like biased responses, harmful suggestions, or just making things up – it's hard to know if those changes actually work and if they cause *other* unexpected problems. Previous research often looked at only one or two things at a time, like truthfulness, and didn't explore the complex trade-offs involved in changing a model's core behavior. It's like adjusting one setting on a car and hoping it doesn't mess up something else.

What's the solution?

The researchers created a testing platform called SteeringControl. They built it using common 'steering' techniques and tested it on two popular language models, Qwen and Llama. They specifically looked at how well these techniques worked at reducing bias, harmful outputs, and hallucinations, and also checked for unintended consequences like the model being overly agreeable (sycophancy) or lacking common sense. They created a dataset to measure these different behaviors and how they interact with each other.

Why it matters?

This work is important because it shows that simply 'steering' a language model isn't a one-size-fits-all solution. The best approach depends on the specific model, the specific behavior you're trying to change, and how those changes affect other aspects of the model's performance. Understanding these trade-offs is crucial for building AI systems that are both helpful and safe, and this research provides a tool to systematically investigate those relationships.

Abstract

We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives--bias, harmful generation, and hallucination--and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.

View Paper