Knocking-Heads Attention

Zhanchao Zhou, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Jianguo Li

2025-10-28

Summary

This paper introduces a new way to improve how large language models pay attention to different parts of the input, called 'Knocking-Heads Attention' or KHA.

What's the problem?

Current large language models use 'multi-head attention' which is like having multiple brains working in parallel. While this is powerful, each 'brain' (or head) becomes less focused as you add more of them. Existing methods just combine the outputs of these heads without letting them really interact and share information with each other during the attention process, limiting their overall effectiveness.

What's the solution?

The researchers developed KHA, which allows these attention 'heads' to communicate with each other *before* they focus on the input. They do this by using a special mathematical trick – a shared projection matrix – that lets the heads subtly influence each other. Initially, each head still works independently, but over time they learn to work together and create a more integrated understanding of the input. Importantly, this doesn't add a lot of extra computational cost.

Why it matters?

This new attention mechanism leads to better and more stable training of large language models. When tested on a large model, KHA outperformed existing attention methods on various tasks, meaning it helps models learn more effectively and achieve better results. This could lead to more powerful and reliable AI systems in the future.

Abstract

Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.

View Paper