From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Yuling Shi, Songsong Wang, Chengcheng Wan, Xiaodong Gu

2024-10-03

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Summary

This paper introduces MGDebugger, a new debugging tool that helps fix errors in code by breaking down the code into smaller parts and addressing issues at different levels of detail.

What's the problem?

Large language models (LLMs) can generate code, but the code often contains subtle errors that prevent it from working correctly. Current debugging systems treat the entire generated code as one unit, making it hard to identify and fix specific problems, whether they are small syntax mistakes or larger logical flaws.

What's the solution?

MGDebugger improves the debugging process by breaking down problematic code into a hierarchical structure of smaller functions. This allows it to analyze and fix bugs step by step, starting from the smallest pieces and working up to the whole program. It also uses a simulated Python executor to track how the code runs and identify where things go wrong. Extensive testing showed that MGDebugger is more effective than existing methods, achieving a significant improvement in accuracy and a high success rate for fixing bugs.

Why it matters?

This research is important because it enhances the reliability of code generated by AI systems. By making it easier to debug complex code, MGDebugger can help developers save time and reduce frustration when working with AI-generated programs, leading to more effective use of technology in software development.

Abstract

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

View Paper