GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Shufan Jiang, Chios Chen, Zhiyang Chen

2026-04-08

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Summary

This paper investigates how well artificial intelligence, specifically large language models, can find bugs in software without human help. It focuses on game development as a good example of a complex software area.

What's the problem?

Finding bugs in software is really hard, especially when the software is running and interacting with things – unlike creating code from scratch. Current AI models are good at *making* code, but not so good at *finding* mistakes in existing code. There wasn't a good standardized way to test how well these AI models could actually discover bugs in a realistic setting.

What's the solution?

The researchers created a new testing ground called GBQA, which includes 30 different games with 124 known bugs that were checked by human experts. They built these games and added the bugs using a system that could create lots of test cases. They also designed an AI agent that could play the games and try to find the bugs, using a method where the AI thinks step-by-step and remembers what it has already tried. They then tested several powerful AI models on this benchmark.

Why it matters?

This work is important because it shows that even the best AI models still struggle to find bugs on their own. The GBQA benchmark provides a way to measure progress in this area and encourages further research into making AI better at autonomous software testing, which could eventually lead to more reliable software.

Abstract

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

View Paper