WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu
2026-04-21
Summary
This paper introduces a new way to test how well large language models can build websites from scratch, edit existing ones, and fix broken code, going beyond simple code correctness to include how things *look* and how users interact with the site.
What's the problem?
Currently, testing these AI coding assistants focuses on whether the code they write actually *runs* without errors. However, building a good website involves more than just working code; it needs to look good, be easy to use, and respond correctly to user actions. Existing tests don't really measure these important aspects of web development, leaving a gap in understanding how well these models truly perform.
What's the solution?
The researchers created a benchmark called WebCompass. This benchmark gives the AI models tasks involving creating websites from text descriptions, editing existing websites based on instructions (sometimes with images or videos showing what needs to be changed), and fixing broken websites. Importantly, they don't just check if the code runs; they have the AI *use* the website it created in a real web browser, automatically testing how it looks and behaves, similar to how a human tester would. They also use other AI models to help judge the quality of the edits and repairs.
Why it matters?
This work is important because it provides a more realistic and comprehensive way to evaluate AI coding assistants for web development. By focusing on the entire web development lifecycle – creating, editing, and fixing – and including visual quality and user interaction, it helps identify the strengths and weaknesses of these models and guides future improvements, ultimately leading to better AI tools for building websites.
Abstract
Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.