Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang

2025-10-31

Kimi Linear: An Expressive, Efficient Attention Architecture

Summary

This paper introduces a new way to process information in artificial intelligence models, called Kimi Linear, that aims to be faster and more efficient than current methods while still performing really well.

What's the problem?

Current AI models, especially those dealing with language, often use something called 'full attention'. This allows the model to consider all parts of the input when making decisions, but it becomes incredibly slow and memory-intensive when dealing with long pieces of text or complex tasks. Essentially, it's like trying to remember everything you've ever read – it gets overwhelming quickly.

What's the solution?

The researchers developed Kimi Linear, which uses a 'linear attention' approach. Instead of looking at everything at once, it focuses on key parts of the information in a smarter way. A core component is 'Kimi Delta Attention' which is a refined method for using the model's memory. They also created a special algorithm that makes the calculations much faster and more efficient, especially on computer hardware. They built a model with a lot of parameters and tested it against models using full attention, showing it performed better and used significantly less memory.

Why it matters?

This work is important because it offers a potential replacement for the standard 'full attention' method used in many AI models. Kimi Linear can handle longer inputs, process information faster, and use less memory, making it more practical for real-world applications like chatbots, translation, and other tasks that require understanding large amounts of text. The researchers even shared their code and model so others can build upon their work.

Abstract

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

View Paper