Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding

2025-01-23

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Summary

This paper talks about Kimi k1.5, a new AI model that uses reinforcement learning to improve its performance across various tasks. It's like teaching a computer to learn from its experiences, similar to how humans learn by trial and error.

What's the problem?

Current AI models, called large language models (LLMs), are really good at understanding and generating text, but they're limited by the amount of data available to train them. They also struggle with tasks that require decision-making and reasoning. It's like having a super-smart student who knows a lot of facts but has trouble applying that knowledge to solve new problems.

What's the solution?

The researchers developed Kimi k1.5, which uses reinforcement learning to help the AI model learn and improve itself. They used special techniques to help the AI handle longer pieces of information and make better decisions. They also found ways to make the AI work well with both text and images. The team tested Kimi k1.5 on various challenging tasks, including math problems and coding challenges, and found that it performed really well, sometimes even better than other top AI models.

Why it matters?

This matters because it shows a new way to make AI smarter and more capable. By using reinforcement learning, AI models like Kimi k1.5 can potentially keep improving themselves beyond the limits of their initial training data. This could lead to AI that's better at solving complex problems, understanding different types of information (like text and images together), and adapting to new situations. In the future, this could mean smarter AI assistants, more advanced problem-solving tools, and AI that can help with complicated tasks in fields like science, engineering, and education.

Abstract

Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).

View Paper