PlayCoder: Making LLM-Generated GUI Code Playable

Zhiyuan Peng, Wei Tao, Xin Yin, Chenhao Ying, Yuan Luo, Yiwen Guo

2026-04-22

PlayCoder: Making LLM-Generated GUI Code Playable

Summary

This paper investigates how well large language models, which are good at writing code, can create interactive graphical user interface (GUI) applications like games and other programs with buttons and windows.

What's the problem?

Current methods for testing code generation focus on whether the code simply *works* based on specific test cases. This doesn't work well for GUIs because GUIs are all about how a user interacts with them over time – a program might pass a test initially but break when a user clicks buttons in a certain order. Existing benchmarks aren't designed to evaluate these interactive 'flows' and the underlying logic of a GUI.

What's the solution?

The researchers created a new benchmark called PlayEval, which includes 43 real-world GUI applications written in common languages like Python and JavaScript. They also developed a new way to measure success, called Play@k, which checks if at least one out of several attempts by the model can be played through completely without errors. To automatically test these applications, they built PlayTester, another AI that acts like a user and tries to break the generated GUI. Finally, they introduced PlayCoder, a system that uses multiple AI agents to repeatedly generate, test, and fix GUI code until it works well.

Why it matters?

This work is important because it shows that even though large language models can often *compile* GUI code (meaning it doesn't have basic syntax errors), they frequently make logical errors that cause the application to malfunction during use. PlayEval and PlayCoder provide tools to better evaluate and improve the ability of these models to create functional and user-friendly GUI applications, which is a crucial step towards using AI to build more complex software.

Abstract

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

View Paper