Efficient Long-context Language Model Training by Core Attention Disaggregation

Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang

2025-10-22

Efficient Long-context Language Model Training by Core Attention Disaggregation

Summary

This paper introduces a new method called Core Attention Disaggregation (CAD) to speed up the training of really large language models, specifically when dealing with very long pieces of text.

What's the problem?

When training these large language models with long texts, a key part called 'core attention' becomes a bottleneck. It takes a lot more computing power as the text gets longer – growing quadratically – while other parts of the model grow more slowly. This creates an imbalance, where some processors finish their work much later than others, slowing down the whole process. It's like having one person on a team doing way more work than everyone else.

What's the solution?

CAD solves this by separating the 'core attention' calculation and giving it its own dedicated set of processors. Because core attention doesn't need to learn anything (it's stateless) and can work with chunks of text efficiently, the system can dynamically assign tasks to these processors to keep them all busy and balanced. They built a system called DistCA that overlaps communication with computation and uses memory efficiently to make this happen.

Why it matters?

This is important because it allows us to train much larger and more powerful language models with longer context windows. The paper shows significant speed improvements – up to 35% faster – and eliminates the slowdown caused by some processors lagging behind, ultimately making it more practical to work with these advanced AI systems.

Abstract

We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.

View Paper