DCPO: Dynamic Clipping Policy Optimization

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin

2025-09-03

DCPO: Dynamic Clipping Policy Optimization

Summary

This paper introduces a new method, Dynamic Clipping Policy Optimization (DCPO), to improve how large language models learn through a process called reinforcement learning. It focuses on making the learning process more effective when the model receives feedback, or 'rewards', for its responses.

What's the problem?

Current methods for reinforcement learning with large language models, like GRPO, often struggle because the signals used to update the model get 'stuck'. This happens because the way the model's responses are evaluated uses fixed limits and treats all rewards the same, preventing the model from learning effectively from its mistakes and successes. Essentially, the model doesn't get clear enough instructions on how to improve.

What's the solution?

DCPO solves this by dynamically adjusting how much the model can change its behavior when learning. Instead of fixed limits, it adapts based on how confident the model was in its initial response. It also smooths out the reward signals over time, making sure the model learns from all its experiences, not just the most recent ones. This allows for better exploration of different responses and more efficient use of the feedback it receives.

Why it matters?

This research is important because it significantly improves the performance of large language models in tasks requiring reasoning. DCPO achieves better results on several challenging benchmarks compared to existing methods, meaning the models become smarter and more reliable. It also makes the learning process faster and more efficient, which is crucial for developing even more powerful AI systems.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

View Paper