SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator
Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang
2024-12-17

Summary
This paper introduces SepLLM, a new framework designed to speed up large language models (LLMs) by compressing segments of text into single separator tokens, improving their efficiency without losing performance.
What's the problem?
Large language models are powerful but very large, which makes them slow and resource-intensive to use. They often have complex structures that require a lot of computational power, leading to longer processing times and higher costs. Additionally, some special tokens in the models do not add much meaningful information but still take up space and processing power.
What's the solution?
SepLLM addresses these issues by identifying that certain special tokens (separators) can be used to condense information from the segments of text between them. By compressing these segments into the separator tokens, the model can reduce the number of tokens it processes, speeding up inference time. The framework also includes efficient training methods to enhance performance further. Experiments showed that using SepLLM with the Llama-3-8B model could reduce memory usage by over 50% while still performing well on benchmarks.
Why it matters?
This research is important because it makes large language models more accessible and efficient, allowing them to be used in more applications without requiring as much computational power. By improving how these models process information, SepLLM can help advance natural language processing technology in various fields such as chatbots, translation services, and content generation.
Abstract
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.