ACECODER: Acing Coder RL via Automated Test-Case Synthesis
Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen
2025-02-05
Summary
This paper talks about ACECODER, a new method that uses reinforcement learning (RL) and automated test-case generation to improve how AI models write and debug code. It focuses on making coder models smarter and more efficient by teaching them through tests.
What's the problem?
Most AI coder models rely on a method called supervised fine-tuning, but this approach doesn’t fully explore the potential of reinforcement learning because there isn’t enough reliable data to reward the models for good coding. This limits their ability to learn effectively and perform better in complex coding tasks.
What's the solution?
The researchers developed ACECODER, which creates large-scale automated test cases from existing code data. These test cases help train reward models using a technique called Bradley-Terry loss, allowing coder models to learn better through reinforcement learning. They tested this method on various coding benchmarks and showed significant improvements in accuracy, with some smaller models performing as well as much larger ones. They also demonstrated that their RL approach could improve performance in just 80 optimization steps.
Why it matters?
This research is important because it shows how reinforcement learning can make AI coder models more powerful and efficient. By using automated test cases, ACECODER helps these models write better code and debug more effectively, which could lead to faster software development and fewer coding errors in real-world applications.
Abstract
Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on <PRE_TAG>HumanEval-plus</POST_TAG> by over 25\% and <PRE_TAG>MBPP-plus</POST_TAG> by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.