MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

Reuben A. Luera, Ryan Rossi, Franck Dernoncourt, Samyadeep Basu, Sungchul Kim, Subhojyoti Mukherjee, Puneet Mathur, Ruiyi Zhang, Jihyung Kil, Nedim Lipka, Seunghyun Yoon, Jiuxiang Gu, Zichao Wang, Cindy Xiong Bearfield, Branislav Kveton

2025-10-15

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

Summary

This paper explores whether powerful AI models, specifically those that can understand both text and images, can help designers evaluate user interfaces early in the design process.

What's the problem?

Designing good user interfaces usually requires a lot of user testing, which can be expensive and time-consuming, especially when you're just starting out with different ideas. Designers need a quick way to get feedback on their designs *before* investing heavily in formal user studies. Previous AI attempts focused on things like predicting clicks on websites, but this paper wants to see if AI can judge more subjective qualities of a design, like how visually appealing or easy to use it is, across many different types of interfaces.

What's the solution?

The researchers compared the judgments of three different AI models – GPT-4o, Claude, and Llama – to actual human preferences. They showed all of them 30 different user interface designs that people had already rated. The goal was to see how well the AI could predict what humans liked and disliked about each design, looking at different aspects of the interface. They didn't just ask if the AI liked one design over another, but also if it agreed with human opinions on specific qualities like clarity or aesthetics.

Why it matters?

This research is important because if AI can reliably give useful feedback on designs early on, it could save designers a lot of time and money. It could help them quickly identify the most promising ideas and avoid wasting effort on designs that users probably won't like. While the AI isn't perfect and doesn't always agree with humans, it shows potential as a tool to *supplement* traditional user research, not replace it.

Abstract

In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.

View Paper