ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng
2025-11-03
Summary
This paper introduces ThinkMorph, a new AI model designed to improve how computers reason about both images and text together, a process called multimodal reasoning.
What's the problem?
Current AI struggles with truly understanding the connection between what it 'sees' in an image and what it 'reads' in text. It's not clear how to best structure the AI's thought process when it needs to use both types of information to solve a problem. Existing models often treat images and text as just different ways of saying the same thing, instead of using them as complementary tools for reasoning.
What's the solution?
The researchers created ThinkMorph, which learns by studying a large collection of examples showing step-by-step reasoning that combines text and images. ThinkMorph doesn't just look at an image and then describe it; it learns to actively manipulate the visual information in its 'thoughts' while also keeping its reasoning logical and consistent in words. It essentially generates a series of text and image steps that build on each other to reach a solution.
Why it matters?
ThinkMorph performs significantly better than previous models on tasks that require visual understanding and reasoning. It also shows the ability to handle new, unseen tasks and even demonstrates skills like manipulating images in ways it wasn't specifically trained for. This suggests that building AI that can seamlessly integrate vision and language in this way is a promising path towards creating more intelligent and adaptable systems.
Abstract
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.