Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao
2025-10-23
Summary
This paper introduces a new series of AI models called Ring-linear, specifically Ring-mini-linear-2.0 and Ring-flash-linear-2.0, designed to be faster and more efficient at processing long pieces of text.
What's the problem?
Large AI models are really good at understanding and generating text, but they require a lot of computing power and memory, especially when dealing with long documents or conversations. This makes them expensive to run and limits their practical use. Existing models struggle with the 'long context' problem – efficiently processing very long inputs.
What's the solution?
The researchers created these Ring-linear models using a clever combination of two types of 'attention' mechanisms – linear attention and softmax attention. This hybrid approach significantly reduces the amount of data that needs to be moved around and the number of calculations required, making the models much faster and cheaper to use. They also developed a special software library called 'linghe' to further speed up the training process. By carefully adjusting how much of each attention type is used, they found the best possible model structure.
Why it matters?
These models represent a significant step forward in making powerful AI more accessible. They reduce the cost of running these models by a large margin – up to 1/10th the cost of some existing models – and improve training efficiency by 50%. This means we can potentially use these advanced AI capabilities in more applications, like analyzing long legal documents, summarizing extensive research papers, or having more natural and complex conversations with AI assistants, while maintaining top-level performance.
Abstract
In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.