Value Residual Learning For Alleviating Attention Concentration In Transformers

Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan

2024-10-25

Value Residual Learning For Alleviating Attention Concentration In Transformers

Summary

This paper presents Value Residual Learning, a new method designed to reduce a problem called attention concentration in Transformer models, which can improve their performance in processing information.

What's the problem?

Transformers are powerful models that use self-attention to understand relationships between words in a sequence. However, when you stack multiple attention layers, they can become too focused on certain parts of the input (this is called attention concentration), making it harder for them to learn effectively. This can lead to poorer performance, especially in deeper layers of the model.

What's the solution?

The authors propose a new model called ResFormer, which helps alleviate attention concentration by allowing earlier layers of the model to share their information with later layers through a technique called residual connections. They also introduce a variant called SVFormer, where all layers use the same value embeddings from the first layer, significantly reducing memory usage by nearly 50%. This approach improves how well the model learns and performs on various tasks without needing extra computational resources.

Why it matters?

This research is important because it enhances the efficiency and effectiveness of Transformer models, making them better at understanding complex information. By addressing the issue of attention concentration, these improvements can lead to better performance in applications like natural language processing, image recognition, and other fields that rely on deep learning.

Abstract

Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

View Paper