WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

Mingde Xu, Zhen Yang, Wenyi Hong, Lihang Pan, Xinyue Fan, Yan Wang, Xiaotao Gu, Bin Xu, Jie Tang

2025-11-13

WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

Summary

This paper introduces WebVIA, a new system designed to automatically create working websites from design images, focusing on making those websites actually *do* things, not just look pretty.

What's the problem?

Currently, turning a design for a website (like a mockup in Photoshop) into actual code is a tedious and repetitive task for developers. Recent AI models can generate the basic structure of a website from an image, but they only create static pages – meaning they don't have buttons that work, forms that submit, or any other interactive elements. They can't handle the 'doing' part of a website.

What's the solution?

The researchers created WebVIA, which works in three steps. First, an 'exploration agent' takes lots of screenshots of the website as a user would interact with it, figuring out all the different things you can click on and how the page changes. Second, a special AI model, WebVIA-UI2Code, uses these screenshots to generate the code for an interactive website. Finally, a 'validation module' checks if the generated code actually works as expected. They also improved the AI model by training it specifically for this task.

Why it matters?

This work is important because it moves us closer to a future where AI can automate a significant part of web development. By creating websites that are not only visually accurate to the design but also fully functional, WebVIA could save developers a lot of time and effort, and potentially allow people with no coding experience to build their own interactive websites.

Abstract

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at https://zheny2751-dotcom.github.io/webvia.github.io/{https://webvia.github.io}.

View Paper