Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol

Roham Koohestani, Philippe de Bekker, Maliheh Izadi

2025-03-12

Benchmarking AI Models in Software Engineering: A Review, Search Tool,
and Enhancement Protocol

Summary

This paper talks about creating better tests (benchmarks) to check how well AI models handle software tasks like coding and bug fixes, along with tools to find and improve these tests.

What's the problem?

There are too many AI tests for coding tasks scattered around, making it hard to pick the right ones, and many existing tests have flaws or aren’t standardized.

What's the solution?

Researchers made BenchScout, a search tool to find relevant tests easily, and BenchFrame, a method to fix test flaws. They tested it by upgrading the HumanEval coding test to make it harder and more accurate.

Why it matters?

This helps developers fairly compare AI models for coding tasks, ensuring they work well in real-world projects and pushing AI to handle tougher challenges.

Abstract

Benchmarks are essential for consistent evaluation and reproducibility. The integration of Artificial Intelligence into Software Engineering (AI4SE) has given rise to numerous benchmarks for tasks such as code generation and bug fixing. However, this surge presents challenges: (1) scattered benchmark knowledge across tasks, (2) difficulty in selecting relevant benchmarks, (3) the absence of a uniform standard for benchmark development, and (4) limitations of existing benchmarks. In this paper, we review 173 studies and identify 204 AI4SE benchmarks. We classify these benchmarks, analyze their limitations, and expose gaps in practices. Based on our review, we created BenchScout, a semantic search tool to find relevant benchmarks, using automated clustering of the contexts from associated studies. We conducted a user study with 22 participants to evaluate BenchScout's usability, effectiveness, and intuitiveness which resulted in average scores of 4.5, 4.0, and 4.1 out of 5. To advance benchmarking standards, we propose BenchFrame, a unified method to enhance benchmark quality. As a case study, we applied BenchFrame to the HumanEval benchmark and addressed its main limitations. This led to HumanEvalNext, featuring (1) corrected errors, (2) improved language conversion, (3) expanded test coverage, and (4) increased difficulty. We then evaluated ten state-of-the-art code language models on HumanEval, HumanEvalPlus, and HumanEvalNext. On HumanEvalNext, models showed a pass@1 score reduction of 31.22% and 19.94% compared to HumanEval and HumanEvalPlus, respectively.

View Paper