ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan

2025-07-08

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code
Generation Evaluation

Summary

This paper talks about ArtifactsBench, a new way to evaluate code generated by AI models by focusing on how well the code produces visual and interactive results. It uses a large multimodal language model as a judge to assess the quality of the code artifacts.

What's the problem?

The problem is that current evaluations of AI-generated code mostly look at whether the code runs, but they don’t carefully measure how good the resulting visuals or interactive outputs are, which is important for many real applications.

What's the solution?

The researchers created ArtifactsBench to test both the functionality and the visual-interactive quality of code outputs. They use a multimodal language model that can understand text, images, and interactions to judge how well the code works and how visually appealing or usable the resulting artifacts are. This method closely matches human opinions on quality.

Why it matters?

This matters because it helps improve AI code generation by encouraging models to not only write code that runs but also produces better and more engaging user experiences, which is key for applications like games, websites, and design tools.

Abstract

ArtifactsBench is a new benchmark for evaluating the visual and interactive quality of code-generated artifacts using a Multimodal LLM-as-Judge, achieving high consistency with human preferences.

View Paper