InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Qiyao Wang, Haoran Hu, Longze Chen, Hongbo Wang, Hamid Alinejad-Rokny, Yuan Lin, Min Yang

2026-05-01

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Summary

This paper investigates how well new AI models can build websites based on instructions from regular people who aren't programmers, and finds they struggle with unclear requests.

What's the problem?

Currently, AI is getting good at writing code, but it assumes the instructions it receives are clear and well-defined. In reality, people who don't know how to code often give vague, repetitive, or even contradictory instructions when asking for a website to be built. This leads to the AI blindly trying to follow instructions that don't make sense, resulting in a broken or incorrect website – this is called 'blind execution'. Existing tests don't accurately reflect this real-world problem.

What's the solution?

The researchers created a new testing environment called InteractWeb-Bench. This environment simulates typical users with different levels of clarity in their requests, including users who are ambiguous, redundant, or inconsistent. The AI can ask clarifying questions, try to implement the instructions, verify its work, and submit the website for review. This allows the researchers to see how the AI handles messy, real-world instructions and whether it can improve through interaction.

Why it matters?

This research is important because it highlights a major limitation of current AI models. If AI is going to truly help non-programmers build websites, it needs to be able to understand and adapt to unclear instructions. This work points the way towards building AI that can actively engage with users to refine their ideas and create the website they actually want, rather than just failing silently when faced with ambiguity.

Abstract

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

View Paper