VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang

2024-12-03

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Summary

This paper introduces VisOnlyQA, a new dataset designed to evaluate how well large vision language models (LVLMs) understand geometric information in images, especially in scientific figures.

What's the problem?

Large vision language models often struggle to accurately interpret visual information, which can lead to mistakes when they analyze images. There are not enough good datasets available to specifically test their ability to perceive geometric and numerical details in images, making it hard to know how well these models perform in real-world scenarios.

What's the solution?

To address this issue, the researchers created VisOnlyQA, which includes 1,200 multiple-choice questions based on scientific figures that require understanding geometric and numerical information. The dataset is designed to focus solely on visual perception without involving reasoning or external knowledge. It also includes synthetic training data with 70,000 examples to help improve the models' performance. The experiments showed that many existing LVLMs performed poorly on these tasks compared to human accuracy, but fine-tuning with the new synthetic data showed some improvement.

Why it matters?

This research is important because it highlights the limitations of current large vision language models in understanding visual information. By providing a specialized dataset like VisOnlyQA, researchers can better evaluate and improve these models' capabilities, which is crucial for applications in fields like science, education, and technology where accurate image interpretation is essential.

Abstract

Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision Language Models (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger language models improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

View Paper