Learning to Generate Unit Tests for Automated Debugging
Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal
2025-02-04

Summary
This paper talks about UTGen and UTDebug, new methods for teaching AI models to create better unit tests and use them to fix bugs in code more effectively. It focuses on improving how AI can generate tests that find errors while also predicting the correct outputs for those tests.
What's the problem?
Creating good unit tests is tricky for AI models because there's a trade-off between making tests that can find bugs in faulty code and correctly guessing what the output should be without knowing the right answer. This makes it hard for AI to use unit tests to debug code effectively.
What's the solution?
The researchers developed UTGen, which teaches AI models to create unit tests that can both find errors and predict the correct outputs based on task descriptions and the code being tested. They also created UTDebug, a system that uses UTGen's tests to help AI models fix bugs more effectively. UTDebug uses clever techniques to improve test accuracy and avoid making mistakes when fixing code.
Why it matters?
This research matters because it helps AI models become better at finding and fixing bugs in code automatically. By improving how AI creates and uses unit tests, it can make software development faster and more reliable. The improvements shown in the study, like boosting accuracy by over 12% on some tasks, could lead to more efficient coding practices and fewer bugs in software.
Abstract
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM) as it iteratively debugs faulty code, motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions and candidate code. We integrate UTGen into UTDebug, a robust debugging pipeline that uses generated tests to help LLMs debug effectively. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), UTDebug (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and back-tracks edits based on multiple generated UTs to avoid overfitting. We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen-2.5 7B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3% and 12.35% (respectively) over other LLM-based UT generation baselines.