Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Zheyuan Yang, Zexi Kuang, Xue Xia, Yilun Zhao

2025-06-18

Can LLMs Generate High-Quality Test Cases for Algorithm Problems?
TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Summary

This paper talks about TestCase-Eval, a new way to test how well large language models (LLMs) can create good test cases for algorithm problems, which are the different input scenarios used to check if solutions work correctly.

What's the problem?

The problem is that not all test cases generated by AI models cover enough different situations or reveal mistakes in code, and some generated test cases don’t follow the problem’s rules, making it hard to trust AI-generated tests for checking code properly.

What's the solution?

The researchers made TestCase-Eval, a benchmark with 500 algorithm problems and many human-written solutions, to measure how well LLMs generate test cases that cover lots of scenarios and can catch errors in incorrect code. They tested 19 advanced models and showed where they succeed and fail, giving insights to improve future AI test case generation.

Why it matters?

This matters because good test cases help programmers find errors and improve their code. By understanding how well AI can create these test cases, we can build better tools to help programmers write more reliable and error-free software.

Abstract

TestCase-Eval is a benchmark for evaluating LLMs in generating comprehensive and targeted test cases for algorithm problems.

View Paper