Selective Attention Improves Transformer
Yaniv Leviathan, Matan Kalman, Yossi Matias
2024-10-07

Summary
This paper introduces Selective Attention, a new method that improves the performance of Transformer models by focusing only on the most important parts of the input data, rather than considering everything equally.
What's the problem?
In standard Transformer models, the attention mechanism looks at all parts of the input equally, which can lead to confusion and inefficiency. This means that when the model processes information, it might waste resources on unimportant details, which can degrade its overall performance and require more memory and computing power.
What's the solution?
The authors propose Selective Attention, which allows the model to ignore unnecessary elements in the input. This method is simple and does not require additional parameters. By implementing Selective Attention, Transformers can reduce their memory usage significantly while maintaining or even improving their accuracy. For instance, models using Selective Attention can operate with much smaller context sizes and still achieve similar performance to larger models with more complex attention mechanisms.
Why it matters?
This research is important because it demonstrates a way to make AI models more efficient without sacrificing quality. By using Selective Attention, Transformers can process information faster and with less energy, making them more practical for real-world applications. This advancement could lead to better performance in tasks like language translation, text generation, and other areas where understanding context is crucial.
Abstract
Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.