MAVIS: Mathematical Visual Instruction Tuning
Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li
2024-07-13

Summary
This paper introduces MAVIS, a new method designed to improve how multi-modal large language models (MLLMs) solve mathematical problems that involve visual elements like diagrams and charts. It focuses on enhancing the models' abilities to understand and reason about these visuals.
What's the problem?
While MLLMs are good at handling general tasks that involve both text and images, they struggle with math-related problems that require interpreting diagrams. This includes difficulties in visual encoding of math diagrams, aligning diagrams with language, and performing mathematical reasoning. These shortcomings highlight the need for better training data and methods specifically tailored for visual mathematics.
What's the solution?
MAVIS addresses these issues through a structured approach that includes three main training stages. First, it creates a dataset called MAVIS-Caption, which contains 558,000 pairs of diagrams and captions to help improve how the model understands visual information. Next, it aligns this visual understanding with a large language model using a special projection layer. Finally, it introduces MAVIS-Instruct, which includes 900,000 annotated visual math problems to further train the model on reasoning skills. This process helps the model learn to think through problems step-by-step while focusing on the visual aspects.
Why it matters?
This research is important because it fills a gap in how AI models handle mathematical visuals, making them more effective at solving real-world math problems. By improving these capabilities, MAVIS can enhance educational tools, assist in complex problem-solving tasks, and contribute to advancements in AI applications that require understanding of both text and visuals.
Abstract
Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS