OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing

2025-11-24

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Summary

This paper introduces OpenMMReasoner, a new and openly available method for improving how AI models reason about both images and text together, also known as multimodal reasoning.

What's the problem?

While AI models are getting better at understanding images and text separately, combining them for complex reasoning is still a challenge. A big part of the problem is that it's hard to know exactly *how* the data used to train these models is created and organized, making it difficult for other researchers to build upon existing work and reproduce results. There's a lack of clear, repeatable steps for training these multimodal reasoning systems.

What's the solution?

The researchers created a two-step training process that's completely transparent. First, they built a large dataset of 874,000 examples specifically designed to help the model learn basic reasoning skills, carefully checking each example for accuracy. Then, they used a smaller, more diverse dataset of 74,000 examples to refine those skills and make the model more reliable. They then shared all the data, code, and training steps publicly.

Why it matters?

This work is important because it provides a solid, well-documented foundation for future research in multimodal reasoning. By sharing everything openly, they allow other researchers to easily build upon their work and improve these models. Their method significantly outperformed existing models, showing an 11.6% improvement on standard tests, and demonstrates that the quality of the training data and the training process itself are crucial for success.

Abstract

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

View Paper