DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou

2025-11-11

DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Summary

This paper focuses on improving how AI models learn to solve competitive programming problems, like those found on websites like LeetCode and Codeforces, using a technique called Reinforcement Learning from Human Feedback (RLHF), but adapted for code – they call it RLVR.

What's the problem?

While recent AI models are getting better at reasoning, most improvements in code generation through RLHF have focused on math problems. Competitive programming, which requires a different kind of problem-solving, hasn't received as much attention. Specifically, the paper points out that creating good training data for these problems and figuring out the best way to train the AI are major challenges. Existing methods often get stuck repeating the same solutions or cutting off their responses before finishing.

What's the solution?

The researchers developed a two-step training process. First, they started with a strong base model and trained it on a large collection of programming problems, encouraging it to explore different solutions and avoid repetition by limiting the length of its responses. Then, they fine-tuned the model on a smaller set of *really* hard problems, giving it more chances to try different approaches for each one and continuously focusing on the problems it struggled with the most. They used a specific optimization algorithm called Group Relative Policy Optimization (GRPO) and tested their method on a model called Qwen2.5-32B.

Why it matters?

This work shows that RLHF can be very effective for competitive programming code generation, achieving results comparable to the best existing models. More importantly, it provides a clear set of guidelines for creating training data, encouraging exploration of solutions, and designing a curriculum that helps the AI learn effectively. This is valuable because it helps advance the field of AI code generation and could lead to more powerful tools for programmers.

Abstract

Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform Pre-GRPO: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.

View Paper