Mixture-of-Depths Attention

Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang

2026-03-17

Summary

This paper focuses on improving large language models by allowing them to effectively use information from all layers, not just the final ones.

What's the problem?

As language models get bigger and have more layers, a problem arises where important information created in the earlier layers gets lost or weakened as it passes through the deeper layers. It's like trying to play a game of telephone – the message gets distorted with each person it goes through. This 'signal degradation' makes it harder for the model to understand and process information accurately.

What's the solution?

The researchers introduced a new technique called 'mixture-of-depths attention,' or MoDA. Essentially, MoDA lets each part of the model's attention mechanism look at information not only from the current layer but also from previous, shallower layers. This helps preserve important signals. They also developed a way to make MoDA work efficiently with existing hardware, almost as fast as a very efficient existing method called FlashAttention-2.

Why it matters?

This work is important because it provides a way to build even larger and more powerful language models without losing crucial information. Their experiments showed that MoDA improves the model's performance on various language tasks with only a small increase in computational cost, suggesting it’s a promising step towards scaling up language models effectively.

Abstract

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

View Paper