From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
2025-03-04
Summary
This paper talks about TOKENSWIFT, a new system that makes AI language models generate very long pieces of text much faster than before, without losing quality.
What's the problem?
Current AI language models take a really long time to create very long texts, sometimes up to 100,000 words. This is because they have to keep reloading the model, manage a lot of information, and often end up repeating themselves.
What's the solution?
The researchers created TOKENSWIFT, which solves these problems in three main ways. First, it reduces how often the model needs to reload. Second, it manages information more efficiently. Third, it prevents the AI from repeating itself too much. These improvements make the AI work more than three times faster across different sizes and types of language models.
Why it matters?
This matters because it can save hours of time when creating long texts with AI. It could make AI writing tools much more practical for tasks that need a lot of text, like writing books or long reports. The researchers also shared their code, which means other scientists can use and improve this technology, potentially leading to even faster and better AI writing tools in the future.
Abstract
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.