LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng

2025-02-13

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its
Hybrid

Summary

This paper talks about LASP-2, a new method to make AI models that process long sequences of information, like text or data, faster and more efficient by improving how they handle communication and computation across multiple devices.

What's the problem?

When AI models work with very long sequences, like a huge document or dataset, they need to split the work across many computers. Current methods for doing this aren't efficient because they either require too much back-and-forth communication between devices or can't handle large sequences well. This slows down training and limits how big the models can get.

What's the solution?

The researchers created LASP-2, which reorganizes how communication and computation happen in these models. It uses a method called AllGather to simplify the way devices share information, reducing the time and memory needed. They also extended this approach to hybrid models that combine two types of attention mechanisms, making it even more versatile. LASP-2 was tested on a model called Linear-Llama3 and showed significant speed improvements compared to older methods.

Why it matters?

This matters because it allows AI models to process much larger sequences of data more quickly and efficiently. By improving both speed and scalability, LASP-2 can help researchers build more advanced AI systems without needing as much computing power. This could lead to better performance in tasks like language understanding, data analysis, and other applications that rely on processing long sequences of information.

Abstract

Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to <PRE_TAG>LASP-2H</POST_TAG> by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-<PRE_TAG>Llama3</POST_TAG> model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and <PRE_TAG>LASP-2H</POST_TAG>. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.

View Paper