MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, Hongsheng Li

2025-10-17

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Summary

This paper introduces MathCanvas, a new system designed to help AI models, specifically Large Language Models (LLMs), get better at solving math problems that usually require diagrams, like geometry. It's about giving these AI models the ability to not just understand text, but also to create and use visual aids to think through problems step-by-step.

What's the problem?

LLMs are really good at working with text and reasoning about things described in words, but they struggle with math problems that need visual thinking. Things like geometry rely heavily on diagrams, and current AI systems either need clunky external tools to create those diagrams or can't make the right kinds of diagrams at the right time to actually solve the problem effectively. They lack the ability to naturally integrate visual thinking into their problem-solving process.

What's the solution?

The researchers created MathCanvas, which works in two main steps. First, they trained the AI model on a huge collection of images and captions, plus examples of how to edit diagrams. This taught the model how to *create* and *change* diagrams. Then, they fine-tuned the model using a new dataset of math problems with step-by-step solutions that include both text and visuals, teaching it *when* to use diagrams and *how* to use them to solve problems. They also created a new, difficult set of math problems to test how well the system works, called MathCanvas-Bench.

Why it matters?

This work is important because it's a big step towards creating AI that can solve complex math problems more like humans do. By giving AI the ability to visually reason, it opens the door to solving a wider range of problems and making AI more useful in fields like education and engineering. The tools and datasets they created are also available for other researchers to build upon, accelerating progress in this area.

Abstract

While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/

View Paper