VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu, Jinlin Wang, Sirui Hong, Bang Liu, Chenglin Wu, Yuyu Luo
2025-10-29
Summary
This paper focuses on how well artificial intelligence, specifically large language models that can understand both text and images, can judge the quality of data visualizations like charts and graphs. It introduces a new way to test these AI models and then builds a better AI specifically for evaluating visualizations.
What's the problem?
Evaluating whether a data visualization is 'good' is tricky because it needs to be accurate with the data, easy to understand, and visually appealing all at the same time. Current AI models that are good at judging regular images aren't very good at judging visualizations, and there wasn't a standard way to measure how well they performed on this task. Essentially, existing AI struggles to give opinions on charts that match what human experts think.
What's the solution?
The researchers created a large collection of visualizations, called VisJudge-Bench, that were rated by experts. They then tested several advanced AI models on this collection and found they weren't very accurate. To fix this, they developed a new AI model, called VisJudge, specifically designed to assess visualization quality. VisJudge performs much better than the general-purpose AI models, getting closer to the opinions of human experts.
Why it matters?
This work is important because as we create more and more data visualizations, we need ways to automatically check their quality. If AI can reliably evaluate visualizations, it can help designers create better charts and graphs, ensuring information is communicated clearly and effectively. This is crucial for making informed decisions based on data.
Abstract
Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.