LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han

2025-02-21

LServe: Efficient Long-sequence LLM Serving with Unified Sparse
Attention

Summary

This paper talks about CoSyn, a new system that helps AI understand images with lots of text in them, like charts and documents. It does this by using AI to create a huge set of practice examples for other AI models to learn from.

What's the problem?

AI models that work with both images and text (called VLMs) have a hard time understanding images with lots of text in them. This is because there aren't enough good examples of these kinds of images for the AI to learn from. It's like trying to learn a new language without enough textbooks or practice materials.

What's the solution?

The researchers created CoSyn, which uses an AI that's good at writing code to make fake but realistic images with text. It works by taking a simple description (like 'make a nutrition label') and turning it into computer code that can create an image. CoSyn then uses this code to make both the image and instructions about the image. They made 400,000 images and 2.7 million instructions this way, giving AI models a ton of practice material.

Why it matters?

This matters because it helps AI get much better at understanding complex images with text, which is important for things like reading charts, understanding documents, or even helping robots interact with the real world. The AI trained with CoSyn's fake images did better than some of the most advanced AI systems out there, including ones from big tech companies. This could lead to smarter AI assistants, better document processing systems, and more advanced robots that can understand visual information in the real world.

Abstract

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse <PRE_TAG>attention</POST_TAG>. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

View Paper