MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li
2025-01-15

Summary
This paper talks about a new AI model called MiniMax-01 that can understand and process much longer pieces of text than other top AI models. It uses a special technique called 'lightning attention' and a team of AI experts working together to handle huge amounts of information efficiently.
What's the problem?
Current AI models are really smart, but they can only work with relatively short pieces of text at a time. This limits how much information they can understand and process at once, which can be a problem for tasks that need a lot of context, like understanding entire books or long conversations.
What's the solution?
The researchers created MiniMax-01, which uses a clever method called 'lightning attention' to quickly focus on the important parts of a text. They also used a 'Mixture of Experts' approach, where different parts of the AI specialize in different tasks. This allows MiniMax-01 to handle up to 4 million words at once during use, which is way more than other AI models. They also made a version that can understand both text and images together.
Why it matters?
This matters because it could make AI much more useful for tasks that need a lot of context. Imagine an AI that could read an entire textbook and answer questions about it, or one that could understand a whole day's worth of conversation. This could lead to better AI assistants, more advanced research tools, and smarter systems for things like customer service or data analysis. By making their work public, the researchers are also helping other scientists build on their ideas, which could speed up progress in AI research.
Abstract
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.