Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Wei He, Zhiheng Xi, Wanxu Zhao, Xiaoran Fan, Yiwen Ding, Zifei Shan, Tao Gui, Qi Zhang, Xuanjing Huang

2024-10-25

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Summary

This paper discusses a new method called Code-as-Intermediary Translation (CIT) that helps improve the reasoning abilities of multimodal large language models (MLLMs) by creating a dataset of charts and questions.

What's the problem?

To answer complex questions about charts, MLLMs need to recognize important information from visual data and reason about it. However, creating high-quality training data for this purpose is expensive and time-consuming, making it difficult to ensure that the models learn effectively.

What's the solution?

The authors developed CIT, which uses code to translate visual chart information into text that LLMs can understand. This approach allows them to automatically create a dataset called ReachQA, which includes 3,000 reasoning-intensive charts and 20,000 question-and-answer pairs. By training models with this dataset, they can improve both their ability to recognize visual information and their reasoning skills.

Why it matters?

This research is important because it provides a cost-effective way to generate high-quality training data for teaching AI models how to understand and reason about visual information. By enhancing the capabilities of MLLMs, this work can lead to better performance in applications like data analysis, education, and any task that requires interpreting charts.

Abstract

Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs). Recent studies highlight that these abilities consist of two main parts: recognizing key information from visual inputs and conducting reasoning over it. Thus, a promising approach to enhance MLLMs is to construct relevant training data focusing on the two aspects. However, collecting and annotating complex charts and questions is costly and time-consuming, and ensuring the quality of annotated answers remains a challenge. In this paper, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks like MathVista. The code and dataset are publicly available at https://github.com/hewei2001/ReachQA.

View Paper