WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li

2025-05-13

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional
Websites from Scratch

Summary

This paper talks about WebGen-Bench, a new set of tests designed to see how well AI models can create complete, working websites from nothing but a description.

What's the problem?

The problem is that while AI models are getting better at writing code, it's still a big challenge for them to build full websites that have many files and features, especially when starting from scratch. There hasn't been a good way to measure or compare how well different models do at this complex task.

What's the solution?

The researchers created WebGen-Bench, which is a special benchmark that checks how accurately AI agents can generate all the code needed for multi-file websites. They found that open-source models actually did better than some of the closed, proprietary ones when tested with these tasks.

Why it matters?

This matters because it helps developers and researchers know which AI models are best for building real websites, and it can speed up the process of making new web tools and services. It also encourages more progress in open-source AI, which anyone can use and improve.

Abstract

A benchmark named WebGen-Bench measures LLM-based agents' ability to generate multi-file website codebases, achieving higher accuracy with open-source models than proprietary ones.

View Paper