Star Attention: Efficient LLM Inference over Long Sequences
Shantanu Acharya, Fei Jia, Boris Ginsburg
2024-11-26

Summary
This paper introduces Star Attention, a new method that makes it faster and easier for large language models (LLMs) to process long sequences of text by improving how they handle attention.
What's the problem?
When LLMs analyze long pieces of text, they can become slow and require a lot of memory because of the way they use something called self-attention. This process can be very resource-intensive, making it difficult for these models to efficiently understand and generate responses based on long inputs.
What's the solution?
Star Attention solves this problem by using a two-phase approach. In the first phase, it processes the text in smaller blocks using local attention, which allows the model to focus on nearby words. In the second phase, it connects these blocks to generate responses by looking at all previous information. This method greatly reduces the amount of memory needed and speeds up processing time by up to 11 times while still keeping the accuracy of the model high (95-100%).
Why it matters?
This research is important because it allows LLMs to handle longer texts more efficiently, which is crucial for applications like summarizing long articles or analyzing extensive documents. By improving how these models work with large amounts of data, Star Attention can enhance their performance in various fields such as natural language processing, artificial intelligence, and machine learning.
Abstract
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.