QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan
2025-12-16
Summary
This paper introduces QwenLong-L1.5, a new AI model designed to be really good at understanding and reasoning with very long pieces of text, even longer than what most current models can handle.
What's the problem?
Current AI models struggle with long-context reasoning. That means they have trouble understanding information and making connections when the relevant details are spread out over a huge amount of text. Training these models to handle long contexts is also unstable and difficult, and even with large context windows, there's a limit to how much information they can process effectively.
What's the solution?
The researchers tackled this problem in three main ways. First, they created a system to automatically generate challenging training questions that require the AI to find and connect information from different parts of long documents. Second, they improved the training process itself using a new method that keeps the training stable and helps the AI learn more effectively. Finally, they added a 'memory' component to the model, allowing it to handle texts exceeding 4 million tokens by combining quick processing with a system for revisiting and recalling information as needed.
Why it matters?
This work is important because it pushes the boundaries of what AI can understand. QwenLong-L1.5 performs as well as, or even better than, leading models like GPT-5 and Gemini-2.5-Pro on tasks requiring long-context reasoning. This improvement isn't just limited to these specific tasks; it also makes the AI better at things like scientific reasoning, using tools that rely on memory, and having more extended and coherent conversations.
Abstract
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.