DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen

2025-03-19

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Summary

This paper introduces DAPO, a new open-source AI system that helps Large Language Models (LLMs) learn to reason better through a technique called reinforcement learning.

What's the problem?

The methods used to train the most advanced reasoning LLMs are often kept secret, making it difficult for other researchers to reproduce the results or build upon them.

What's the solution?

The researchers developed a new algorithm called Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) and made their entire training system open-source. They also shared the key techniques that make their algorithm successful, along with the training code and the dataset they used.

Why it matters?

This work is important because it promotes open research and allows others to replicate and improve upon state-of-the-art reasoning in LLMs. By providing the code, algorithm, and dataset, it makes it easier for the AI community to advance the field.

Abstract

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

View Paper