ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

2025-10-08

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

Summary

This paper introduces ChartAgent, a new system designed to help computers understand charts and answer questions about them, even when the charts don't have helpful labels or text already attached.

What's the problem?

Current AI models that can answer questions about charts struggle when those charts aren't prepped with a lot of text explanations. They often rely on finding keywords instead of actually *reading* the visual information in the chart, like the height of bars or the size of pie slices. This means they have trouble with charts that require careful visual analysis and calculations.

What's the solution?

ChartAgent works differently by mimicking how humans understand charts. Instead of just looking at the whole chart at once, it breaks down a question into smaller visual tasks. It then 'interacts' with the chart image itself, like drawing circles around important parts, cropping out specific sections (like a single slice of a pie chart), and identifying axes. It uses a set of tools specifically designed for charts to complete these tasks step-by-step, building up to an answer.

Why it matters?

This research is important because it shows a significant improvement in chart understanding, especially for complex charts without a lot of text. It’s a step towards AI that can truly *see* and interpret visual data, not just search for keywords. Plus, the ChartAgent system can be added to existing AI models to make them better at chart-based questions, and it works well with different types of charts and different levels of difficulty.

Abstract

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

View Paper