Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Keyu Fan, Weihao Ye, Jing Xiong, Hui Shen, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Yulei Qian, Yuchen Xie, Ngai Wong

2026-04-14

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Summary

This paper is a comprehensive overview of a problem called 'Attention Sink' that affects Transformer models, which are the core technology behind many modern AI applications like large language models.

What's the problem?

Transformer models sometimes focus too much attention on a few unimportant words or parts of the input data, ignoring more relevant information. This is called Attention Sink. It makes it harder to understand *why* the model is making certain decisions, messes with how well it learns, and can even cause it to generate incorrect or nonsensical outputs, like 'hallucinations' where it makes things up. While people have been studying this issue, there wasn't a single place that collected all the research and explained how it all fits together.

What's the solution?

The authors created the first detailed survey of Attention Sink research. They organized the existing work into three main areas: how Attention Sink is used, how it actually works inside the model, and different ways to reduce its negative effects. They essentially created a guide to help researchers understand the current state of knowledge and where to focus their efforts in the future.

Why it matters?

This survey is important because it provides a central resource for anyone working with Transformer models. By clearly explaining Attention Sink and the research surrounding it, it helps researchers build better, more reliable, and more understandable AI systems. It also points the way towards developing the next generation of Transformer models that are less susceptible to this problem.

Abstract

As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.

View Paper