ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou

2025-10-14

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Summary

This paper introduces ReLook, a new system designed to help AI models get better at creating code for websites and apps, specifically the part users *see* and interact with – the 'front-end'. It focuses on making sure the code actually *works* visually, not just that it's technically correct.

What's the problem?

Large language models are really good at writing the behind-the-scenes code that makes things function, but they struggle with front-end development. Unlike other types of code, front-end code is judged by how it *looks* on a screen and how well users can interact with it. It's hard for an AI to know if the website looks right just by reading the code; it needs to 'see' the result. Existing methods don't reliably ensure the generated code actually renders a working, visually correct interface.

What's the solution?

The researchers created ReLook, which uses a system where an AI agent repeatedly generates code, then gets feedback based on screenshots of what that code produces. It's like a 'generate-check-improve' loop. A powerful AI model looks at the screenshot and gives a score, and also suggests specific changes. Importantly, the system *only* accepts improvements – it won't allow the code to get worse with each try. During actual use, the 'checking' part is removed to make it faster, but the AI still improves the code through repeated self-editing.

Why it matters?

This work is important because it significantly improves the ability of AI to create functional and visually appealing website and app interfaces. By focusing on what the user sees and using visual feedback, ReLook overcomes a major limitation of current AI code generation tools, potentially making it easier to build websites and apps with AI assistance.

Abstract

While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.

View Paper