Outcome-Refining Process Supervision for Code Generation

Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang

2024-12-24

Outcome-Refining Process Supervision for Code Generation

Summary

This paper talks about a new approach called Outcome-Refining Process Supervision for improving how AI models generate code, especially for complex programming tasks that require deep reasoning.

What's the problem?

Large language models are good at generating code, but they often struggle with complicated tasks that need careful thinking and planning. Traditional methods rely on final results to guide the model, which can lead to mistakes because they don't help the model understand the steps it took to get there. Additionally, training these models to learn from their mistakes can be expensive and time-consuming.

What's the solution?

The authors propose a new method that focuses on refining the outcomes of the code generation process itself. They introduce a framework that uses execution signals (feedback from running the code) to guide the model through its reasoning steps. This method allows the model to explore multiple possible solutions at once, improving its ability to generate correct and efficient code. The researchers tested this approach on several models and datasets, showing significant improvements in accuracy and efficiency without needing extensive training data.

Why it matters?

This research is important because it enhances the capabilities of AI in programming by providing a structured way for models to learn from their outputs. By focusing on refining both the reasoning process and the final results, this approach can lead to better performance in coding tasks, making AI tools more useful for developers and programmers.

Abstract

Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS

View Paper