Qwen2.5-1M Technical Report
An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu
2025-01-28

Summary
This paper talks about Qwen2.5-1M, a new series of AI language models that can handle incredibly long texts of up to 1 million words or symbols at once. These models are much better at understanding and working with long pieces of information compared to their older versions.
What's the problem?
Previous AI language models were limited in how much text they could process at one time, usually around 128,000 words. This made it hard for them to understand very long documents or conversations, which is important for tasks like analyzing entire books or long legal documents.
What's the solution?
The researchers created Qwen2.5-1M by using special training techniques. They made fake long texts for the AI to practice on, slowly increased the length of texts during training, and used a step-by-step approach to fine-tune the model. They also created a new way for the AI to quickly read and understand long texts without using too much computer power. This includes tricks like only paying attention to the most important parts of the text and breaking long texts into smaller chunks.
Why it matters?
This research matters because it allows AI to work with much longer pieces of text, which opens up new possibilities for how we can use AI in real-world situations. For example, it could help lawyers review long legal documents, assist researchers in analyzing entire scientific papers, or help writers create longer, more coherent stories. The fact that the researchers are sharing their work openly also means that other scientists and companies can build on this technology, potentially leading to even more advanced AI systems in the future.
Abstract
We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.