ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention
Yuhong Chou, Zehao Liu, Ruijie Zhu, Xinyi Wan, Tianjian Li, Congying Chu, Qian Liu, Jibin Wu, Zejun Ma
2025-07-04
Summary
This paper talks about ZeCO, a new method that helps train very large language models with extremely long input sequences more efficiently. ZeCO removes the usual delays caused by communication between devices when they work together on the training process.
What's the problem?
The problem is that when training big AI models with very long sequences that are split across many devices, a lot of time is wasted while devices wait to exchange information. This communication overhead slows down the training and limits scalability.
What's the solution?
The researchers created ZeCO, which uses a special communication technique called All-Scan. This method reduces the data that needs to be shared and overlaps communication with computation so devices don’t have to wait. This leads to almost no communication delay and nearly linear scaling when adding more devices.
Why it matters?
This matters because it allows training huge AI models on ultra-long texts much faster and more efficiently. Faster training means better AI models can be built quicker, helping advance natural language processing and other AI applications.
Abstract
ZeCO, a new sequence parallelism method, enables efficient training of large language models with ultra-long sequences by eliminating communication overhead.