GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Jonathan Roberts, Kai Han, Samuel Albanie

2024-08-22

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Summary

This paper introduces GRAB, a new benchmark designed to evaluate how well large multimodal models can analyze graphs, which are important for understanding complex data relationships.

What's the problem?

Many existing benchmarks for evaluating multimodal models are not challenging enough, which means they don’t push these models to their limits. This is especially true for tasks involving graph analysis, where models need to interpret data points and their connections accurately.

What's the solution?

The authors created GRAB, a synthetic dataset that includes 2,170 questions covering various graph properties and tasks. This benchmark allows researchers to test the capabilities of their models in a controlled setting. They evaluated 20 different multimodal models using GRAB and found that even the best-performing model only achieved a score of 21.7%, indicating that there is still much room for improvement.

Why it matters?

This research is significant because it provides a rigorous way to assess and improve the performance of advanced models in graph analysis. By identifying the strengths and weaknesses of these models, researchers can work towards developing better tools for analyzing complex data, which has applications in fields like social networks, biology, and finance.

Abstract

Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.

View Paper