How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng
2026-03-04
Summary
This paper introduces a new way to test how well we can control the responses of large language models, which are becoming more common in everyday applications.
What's the problem?
Large language models can be unpredictable, sometimes saying things we don't want them to, or acting in ways that aren't consistent. It's hard to know *how* well we can actually steer these models to behave as intended, and current methods for controlling them don't always work well when you need very specific control over their output.
What's the solution?
The researchers created a benchmark called SteerEval. This benchmark tests control in three areas – the language used, the emotional tone (sentiment), and the model's 'personality'. It breaks down control into three levels: what the model should say, how it should say it, and specific examples of how to express it. They then used this benchmark to test existing control methods and found that control often gets weaker when you try to be more precise.
Why it matters?
This work is important because it provides a clear and organized way to evaluate and improve the safety and reliability of large language models. By understanding where current control methods fall short, researchers can develop better techniques to ensure these models behave responsibly and predictably, especially when used in sensitive situations.
Abstract
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.