CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen
2024-06-27

Summary
This paper introduces CharXiv, a new evaluation tool designed to test how well Multimodal Large Language Models (MLLMs) understand charts. It highlights the weaknesses of current models in interpreting complex visual data and aims to provide a more accurate measure of their capabilities.
What's the problem?
Many existing datasets used to evaluate MLLMs focus on simple charts with straightforward questions. This approach can give a false sense of how well these models perform because they may struggle with real-world charts that are more complex. When tested with slightly different charts or questions, the performance of these models can drop significantly, revealing gaps in their understanding.
What's the solution?
CharXiv addresses this issue by providing a comprehensive set of 2,323 diverse and challenging charts sourced from scientific papers. It includes two types of questions: descriptive questions that ask about basic elements of the charts and reasoning questions that require deeper analysis and synthesis of information from multiple visual elements. All charts and questions have been carefully selected and verified by experts to ensure quality. The results from testing various MLLMs show that even the best models still fall short compared to human performance, which is much higher.
Why it matters?
This research is important because it reveals significant shortcomings in how AI models understand charts, which are crucial for tasks like analyzing scientific data or financial reports. By providing a more realistic evaluation framework, CharXiv can help guide future improvements in MLLMs, making them better at interpreting complex visual information and ultimately enhancing their usefulness in real-world applications.
Abstract
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/