Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput
Gabriel Orlanski, Nicholas Roberts, Aws Albarghouthi, Frederic Sala
2025-06-16
Summary
This paper talks about how reward models can help make checking computer code faster by trading a bit of accuracy for speed. Instead of only using very thorough tests to pick the best code solution, they explore using a faster, simpler checker to quickly remove bad solutions before doing a more detailed ranking. This new method speeds up the whole process a lot while only slightly lowering accuracy.
What's the problem?
The problem is that checking computer programs generated by AI can be very slow because the best method uses full test suites, which are thorough but take a lot of time. People usually think you should always use the full tests for the most accurate results without thinking about how much slower that makes the process. This limits how quickly AI can produce and verify code solutions at scale.
What's the solution?
The solution was to introduce a generate-prune-then-rank system where a fast but somewhat less accurate reward model first prunes or removes many of the wrong code solutions quickly. Then, the remaining programs are ranked using the full test suite verifier. This approach keeps accuracy high while making the process over 11 times faster by cutting down how many programs need full testing.
Why it matters?
This matters because speeding up code verification helps AI systems generate useful and correct programs faster, which is important for practical use in software development and automation. By balancing speed and accuracy, this approach makes it easier to scale AI code writing tools and improve how efficiently computers can assist programmers.
Abstract
The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking -- leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.