Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui

2026-01-21

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Summary

This paper is about making sense of how large language models, like the ones powering chatbots, actually *work* internally. It's not enough to just see what they *do*; we need to understand *why* they do it, and then use that understanding to make them better.

What's the problem?

Currently, research into understanding these models, called 'Mechanistic Interpretability,' mostly just describes what's happening inside them. It's like taking something apart to see the pieces, but not knowing how to fix it or improve it. There wasn't a clear, step-by-step way to actually *change* the model based on what researchers learned about its inner workings.

What's the solution?

The authors created a framework called 'Locate, Steer, and Improve.' First, 'Locate' means finding specific parts of the model that control certain behaviors. Then, 'Steer' means figuring out how to change those parts to get the model to do what you want. Finally, 'Improve' means using this process to actually make the model better at things like following instructions, performing tasks, and using resources efficiently. They organized existing research into this framework to make it easier to use.

Why it matters?

This work is important because it turns understanding these complex models into a practical skill. Instead of just being an observational science, Mechanistic Interpretability can now be used to actively improve the models, making them more reliable, capable, and efficient. This is a big step towards building AI systems we can truly trust and control.

Abstract

Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.

View Paper