Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, Bin Dong

2025-09-17

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Summary

This paper focuses on improving how AI systems can solve problems that require understanding both images and text at the same time, which is a tough challenge for even the most advanced AI models.

What's the problem?

Current AI, like GPT-o3, is really good at reasoning with text alone, but struggles when it needs to combine information from pictures and words to figure things out. This means AI has trouble with tasks that require 'multimodal reasoning' – understanding multiple types of data.

What's the solution?

The researchers created a new system that uses captions, or descriptions, to help connect the visual information from images with the textual information. Essentially, it translates what it 'sees' in a picture into words, making it easier for the AI to reason about both together. This system won first place in a competition and also worked well on a separate test for geometric problem-solving.

Why it matters?

This work is important because it moves AI closer to being able to understand the world the way humans do – by combining what we see with what we know. Better multimodal reasoning will allow AI to tackle more complex real-world problems, like helping with science, education, and robotics.

Abstract

Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.

View Paper