< Explain other AI papers

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang

2026-04-02

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Summary

This paper introduces a new way to test how well artificial intelligence can build websites from just images, going beyond simple code generation to full, interactive website creation.

What's the problem?

While AI models are getting better at writing code, there hasn't been a good, comprehensive test to see if they can actually build complete websites from start to finish, based on a visual design like a screenshot. Existing tests focus on small parts of the process, not the whole thing, and don't really reflect the complexity of real-world web development.

What's the solution?

The researchers created a benchmark called Vision2Web, which includes 193 different website-building tasks, ranging from turning a simple image into code to building a full website with multiple pages and interactive features. They also developed a system to automatically check if the AI-built websites actually work correctly, using both a visual checker and an AI judge.

Why it matters?

This work is important because it provides a realistic and challenging test for AI website builders. The results show that even the best AI models still struggle with the more complex aspects of web development, highlighting areas where further research is needed to create truly automated website creation tools.

Abstract

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.