MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou

2026-01-13

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Summary

This paper focuses on making Transformer models, which are really good at things like understanding language and recognizing images, work better with much larger amounts of data. It introduces a new way to handle attention, a key part of these models, to make it faster and more efficient.

What's the problem?

Transformer models use something called 'self-attention' which allows them to focus on different parts of the input data. However, this process gets incredibly slow as the amount of data increases because it requires comparing every piece of data to every other piece. Simpler, faster alternatives called 'linear attention' exist, but they often don't perform as well, and attempts to fix this usually add extra complexity that cancels out the speed benefits. The core issue is that these faster methods tend to lose important details and treat everything as equally important, leading to a loss of nuance.

What's the solution?

The researchers developed a new technique called 'Multi-Head Linear Attention' or MHLA. Instead of trying to process all the data at once, MHLA breaks down the attention process into smaller, independent parts, or 'heads'. Each head focuses on a different aspect of the data, which helps the model maintain a diverse understanding and avoid the 'global context collapse' problem. Importantly, MHLA still achieves the speed benefits of linear attention because it doesn't require comparing every piece of data to every other piece.

Why it matters?

This work is significant because it offers a way to scale up Transformer models to handle much larger datasets without sacrificing performance or efficiency. The improvements across image classification, natural language processing, image generation, and video generation demonstrate that MHLA is a versatile solution with the potential to advance many different areas of artificial intelligence. It allows for faster training and processing of these powerful models, making them more practical for real-world applications.

Abstract

While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.

View Paper