START: Spatial and Textual Learning for Chart Understanding

Zhuoming Liu, Xiaofeng Gao, Feiyang Niu, Qiaozi Gao, Liu Liu, Robinson Piramuthu

2025-12-16

START: Spatial and Textual Learning for Chart Understanding

Summary

This paper introduces a new approach, called START, to help AI models better understand charts and graphs, which is important for tasks like analyzing research papers and reports.

What's the problem?

Current AI models struggle with charts because charts have two key parts: how things are visually arranged (the layout) and the actual data they represent (the numbers and labels). Most models focus on one or the other, but understanding *both* is crucial for truly understanding the chart's meaning. Existing datasets and ways to test chart understanding also don't fully capture how well a model grasps the visual structure of a chart.

What's the solution?

The researchers developed START, which stands for Spatial and Textual learning for chART understanding. They did this in two main ways: first, they taught the AI to connect specific parts of the chart image to the data they represent. Second, they had the AI translate the chart image into computer code, which forces it to understand the underlying data. To help with this, they also created a new dataset, START-Dataset, and a new benchmark, CS-Bench, to better test how well AI models understand chart layouts. They used existing large language models to generate and refine the data and code.

Why it matters?

This work is important because it significantly improves AI's ability to understand charts, leading to better performance on tasks that require analyzing visual information. The new dataset and benchmark will also help researchers develop even more advanced AI models in the future, making AI more useful for real-world applications like scientific research and data analysis.

Abstract

Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

View Paper