Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation
Yi Cui
2025-05-14

Summary
This paper talks about a new way to test how well large language models can write computer code by giving them test cases as prompts, similar to how real programmers use tests to guide their coding.
What's the problem?
The problem is that most current methods for checking if AI can write code don't focus enough on whether the code actually works as intended when tested, and they don't always reflect how coding is done in real life, where tests are very important.
What's the solution?
The researchers created a benchmark where language models are given test cases and asked to write code that passes those tests. This setup checks if the models can understand what the code is supposed to do and learn from the examples provided in the prompt.
Why it matters?
This matters because it helps improve AI's ability to write useful and correct code, making these models more helpful for programmers and safer to use in real-world software development.
Abstract
A new benchmark evaluates large language models in test-driven development tasks using test cases as prompts, emphasizing functionality interpretation and in-context learning.