CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection
Seonglae Cho, Zekun Wu, Adriano Koshiyama
2025-08-20
Summary
This paper introduces a new technique called CorrSteer to better control large language models (LLMs) by picking out important features, making them more helpful for specific tasks like answering questions or reducing bias. It's an automated way to make LLMs perform better on various tasks using fewer examples.
What's the problem?
Existing methods for making LLMs do specific things, like answering questions accurately or behaving in a certain way, often need a lot of example data to show the model what's right and wrong. They also might require storing a lot of information about how the model's internal workings change, which can be difficult and expensive. This limits how effectively we can guide these powerful models.
What's the solution?
The researchers developed CorrSteer, which looks at how well a language model's generated words match the correct answers for a task. It does this by correlating the accuracy of the model's outputs with the internal signals (activations) it produces. This way, it can pick out the most useful internal features without needing pre-made example pairs or storing huge amounts of data. It finds the best way to steer the model by averaging these important signals.
Why it matters?
This approach is important because it provides a simpler and more efficient way to improve how LLMs perform on a variety of tasks. By finding relevant features automatically, it saves time and resources. The improvements seen in tasks like question answering, reducing bias, and preventing harmful outputs show that this method can make LLMs more reliable and useful in real-world applications.
Abstract
Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby avoiding spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma 2 2B and LLaMA 3.1 8B, notably achieving a +4.1% improvement in MMLU performance and a +22.9% improvement in HarmBench with only 4000 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlationbased selection as an effective and scalable approach for automated SAE steering across language model applications.