MoH: Multi-Head Attention as Mixture-of-Head Attention
Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan
2024-10-18

Summary
This paper introduces MoH, or Mixture-of-Head Attention, which improves the way AI models process information by allowing them to use attention heads more efficiently.
What's the problem?
In many AI models, especially those based on the Transformer architecture, a technique called multi-head attention is used to help the model focus on different parts of the input data. However, this approach treats all attention heads equally, which can lead to inefficiencies and subpar performance because not all heads contribute equally to understanding the data.
What's the solution?
To solve this issue, the authors propose a new method called Mixture-of-Head Attention (MoH). This method allows the model to treat each attention head as an expert that can be selectively used based on the needs of the task at hand. Instead of simply adding up the outputs from all heads, MoH uses a weighted sum that prioritizes the most relevant heads for each specific input. This change improves efficiency and maintains or even enhances accuracy. The authors conducted extensive tests showing that MoH outperforms traditional multi-head attention while using only 50% to 90% of the attention heads.
Why it matters?
This research is important because it makes AI models more efficient and effective in processing information. By improving how attention is managed within these models, MoH can lead to better performance in various applications such as language translation, image recognition, and more complex tasks. This advancement could help push forward the capabilities of AI technology.
Abstract
In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.