Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
Zheyuan Yang, Zexi Kuang, Xue Xia, Yilun Zhao
2025-06-18
Summary
This paper talks about TestCase-Eval, a new way to test how well large language models (LLMs) can create good test cases for algorithm problems, which are the different input scenarios used to check if solutions work correctly.
What's the problem?
The problem is that not all test cases generated by AI models cover enough different situations or reveal mistakes in code, and some generated test cases don’t follow the problem’s rules, making it hard to trust AI-generated tests for checking code properly.
What's the solution?
The researchers made TestCase-Eval, a benchmark with 500 algorithm problems and many human-written solutions, to measure how well LLMs generate test cases that cover lots of scenarios and can catch errors in incorrect code. They tested 19 advanced models and showed where they succeed and fail, giving insights to improve future AI test case generation.
Why it matters?
This matters because good test cases help programmers find errors and improve their code. By understanding how well AI can create these test cases, we can build better tools to help programmers write more reliable and error-free software.
Abstract
TestCase-Eval is a benchmark for evaluating LLMs in generating comprehensive and targeted test cases for algorithm problems.