GEBench: Benchmarking Image Generation Models as GUI Environments
Haodong Li, Jingwei Wu, Quan Sun, Guopeng Li, Juanxi Tian, Huanyu Zhang, Yanlin Lai, Ruichuan An, Hongbo Peng, Yuhong Dai, Chenxi Li, Chunmei Qing, Jia Wang, Ziyang Meng, Zheng Ge, Xiangyu Zhang, Daxin Jiang
2026-02-10
Summary
This paper introduces a new way to test how well AI can create realistic and functional computer interfaces, like the windows and buttons you see on your phone or computer.
What's the problem?
Currently, we're pretty good at getting AI to *draw* what a computer screen might look like after you click a button, but we haven't had good tools to check if the AI is actually making logical changes that make sense over a series of actions. Existing tests mostly focus on how pretty the picture is, not if the interface actually *works* as you'd expect when you interact with it over time.
What's the solution?
The researchers created GEBench, a collection of 700 different scenarios that test AI's ability to show a sequence of GUI changes based on user instructions. They also developed a new scoring system, GE-Score, that looks at five key things: does the AI achieve the goal, does the interaction make sense, is the content consistent, does the interface look believable, and is the image quality good. They then tested current AI models using this benchmark.
Why it matters?
This work is important because it highlights that while AI can create individual screen states well, it struggles to maintain a consistent and logical flow when you ask it to simulate a longer interaction. Identifying weaknesses like understanding icons, rendering text correctly, and pinpointing locations within the interface helps researchers focus on improving AI so it can build truly functional and realistic virtual environments.
Abstract
Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.