Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

João Palmeiro, Diogo Duarte, Rita Costa, Pedro Bizarro

2025-10-08

Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

Summary

This paper investigates how well current AI models, like those from OpenAI and Google, can understand and interpret information presented in scatterplots, a very common type of graph used to show relationships between data points.

What's the problem?

Currently, there aren't good standardized tests to specifically measure how well AI can handle tasks related to scatterplots. Existing benchmarks don't really focus on the unique challenges of understanding visual data in this format, so it's hard to know how reliable these AI models are when asked to analyze them. This makes it difficult to assess their ability to, for example, identify clusters of data or find unusual data points.

What's the solution?

The researchers created a large, artificial dataset of over 18,000 scatterplots with different designs and data patterns. They then tested OpenAI’s models and Google’s Gemini 2.5 Flash by giving them a few examples (called 'N-shot prompting') and asking them to perform tasks like counting clusters, finding the centers of those clusters, and identifying outliers. They measured how accurately the AI could complete these tasks.

Why it matters?

This work is important because it provides a way to objectively evaluate AI’s ability to understand scatterplots. The results show that while AI can be pretty good at counting clusters and finding outliers, it struggles with pinpointing the exact location of those clusters. It also suggests that the way a scatterplot is designed, particularly its shape, can affect how well the AI performs, meaning we need to be careful about how we present data to these models.

Abstract

AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.

View Paper