MMFormalizer: Multimodal Autoformalization in the Wild

Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong

2026-01-12

MMFormalizer: Multimodal Autoformalization in the Wild

Summary

This paper introduces MMFormalizer, a new system designed to automatically translate math problems described in everyday language *and* images into a format computers can understand and solve, going beyond just text-based problems.

What's the problem?

Traditionally, autoformalization – the process of converting human-readable math into computer-readable code – has focused on text. However, many real-world math and physics problems involve visual elements and require understanding hidden information like mass or energy that isn't explicitly stated. Existing systems struggle with this 'multimodal' aspect, meaning they can't effectively handle both text and images together to fully understand and formalize the problem.

What's the solution?

MMFormalizer tackles this by combining text understanding with 'grounding,' which means connecting words and concepts to specific things it 'sees' in images. It builds up formal solutions step-by-step, starting with basic visual elements and then adding more complex ideas. Importantly, each step is checked against the visual evidence to make sure it's valid and based on real-world physics or mathematical rules. The system also knows when to stop adding complexity, ensuring the solution isn't overcomplicated.

Why it matters?

This work is significant because it's the first system capable of automatically formalizing problems from areas like classical mechanics, relativity, quantum mechanics, and thermodynamics, which often rely on both visual and textual information. It opens the door to using AI to not just solve math problems, but to *understand* the underlying physics and geometry, and it provides a framework for building more powerful AI systems that can reason about the physical world.

Abstract

Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io

View Paper