MiMo-V2-Flash Technical Report
Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong
2026-01-07
Summary
This paper introduces MiMo-V2-Flash, a new large language model designed to be both powerful and fast at reasoning and acting like an intelligent agent.
What's the problem?
Existing large language models often require a huge number of parameters to achieve high performance, making them slow and expensive to run. It's a challenge to build a model that can compete with the best while being more efficient in terms of size and speed. Also, getting these models to truly master specific skills, like those learned through reinforcement learning, can be difficult.
What's the solution?
The researchers created MiMo-V2-Flash, which uses a clever combination of attention mechanisms – focusing on nearby words and also looking at the whole context – to process information efficiently. They trained it on a massive amount of text and then used a new technique called 'Multi-Teacher On-Policy Distillation' where specialized 'teacher' models guide the learning process, providing detailed feedback to help MiMo-V2-Flash become an expert. They also used a prediction method during use to speed up how quickly the model generates text.
Why it matters?
MiMo-V2-Flash is significant because it achieves performance comparable to much larger models, meaning it's a step towards making powerful AI more accessible and practical. The new training technique could also help improve the ability of AI models to learn and master complex skills. Finally, by releasing the model's code and weights, the researchers are encouraging further research and development in the field.
Abstract
We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.