Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, Roy Ka-Wei Lee
2024-06-27

Summary
This paper talks about Math-LLaVA, a new model designed to improve how large language models (LLMs) understand and solve mathematical problems by using a variety of images and questions. It introduces a dataset called MathV360K that helps enhance the model's ability to reason about math in different contexts.
What's the problem?
While LLMs are good at solving text-based math problems, they often struggle with visual information, like diagrams and equations. Existing datasets that combine images and questions are limited, which means these models can't learn effectively from the wide range of visual data available. This gap makes it hard for LLMs to fully utilize visual information to improve their mathematical reasoning skills.
What's the solution?
To address this issue, the authors created the MathV360K dataset, which includes 40,000 high-quality images paired with questions, collected from 24 different sources. They also synthesized an additional 320,000 question-answer pairs to provide a broader range of mathematical problems. The Math-LLaVA model is based on LLaVA-1.5 and is fine-tuned using this new dataset. This approach led to a significant improvement in the model's ability to handle multimodal (text and image) mathematical reasoning tasks, achieving better performance compared to previous models.
Why it matters?
This research is important because it demonstrates how combining diverse datasets can enhance the capabilities of AI models in understanding complex mathematical concepts. By improving the way LLMs process both text and visual information, Math-LLaVA can help advance educational tools, tutoring systems, and applications that require sophisticated mathematical reasoning.
Abstract
Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista's minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs' mathematical reasoning abilities. The code and data are available at: https://github.com/HZQ950419/Math-LLaVA.