Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men

2026-03-17

Summary

This paper introduces a new way to handle how information flows through the layers of large language models (LLMs), aiming to improve their performance and stability during training.

What's the problem?

Current LLMs use a simple method called 'residual connections' to pass information between layers. While effective, this method adds up all the layer outputs equally, regardless of how important they are. This can cause the signal from earlier layers to get lost as the model gets deeper, making it harder for the model to learn and leading to inconsistent behavior across layers. It's like trying to hear someone whisper at a concert – the signal gets drowned out.

What's the solution?

The researchers propose 'Attention Residuals' (AttnRes). Instead of simply adding layer outputs, AttnRes uses a 'selective' approach. Each layer learns to pay attention to the outputs of *previous* layers, giving more weight to the ones that are most relevant. To make this practical for very large models, they developed 'Block AttnRes,' which groups layers together and only considers information within those groups, reducing the computational cost. They also optimized how the model communicates during training to minimize slowdowns.

Why it matters?

This work is important because it addresses a fundamental limitation in how LLMs are built. By allowing layers to selectively focus on earlier information, AttnRes helps maintain a stronger signal throughout the network, leading to more stable training, better performance on various tasks, and the potential to build even larger and more powerful language models. The improvements were seen consistently across different model sizes and in a large-scale experiment with a 48 billion parameter model.

Abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

View Paper