Thinking with Generated Images
Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu
2025-05-29
Summary
This paper introduces a new way for AI models to solve problems by not just looking at images or reading text, but by actually creating and thinking through their own images during the reasoning process. The idea is that the model can generate pictures as it works through a problem, critique those pictures, and improve its answers by combining both visual and textual thinking.
What's the problem?
Before this work, most AI models could only process images that were given to them or reason step-by-step using just text. This limited their ability to solve complex problems that require visual imagination or multi-step visual reasoning, because they couldn't create or refine their own visual ideas along the way.
What's the solution?
The researchers developed a system where the AI can spontaneously make its own images as part of its thought process. As it tries to solve a problem, it creates intermediate visual steps, sets visual subgoals, and even critiques and improves its own generated images. This lets the AI break down tough visual tasks into smaller parts, check its own work, and make better decisions by switching between text and images as needed.
Why it matters?
This approach makes AI much better at handling complicated tasks that need both visual and logical thinking, like interpreting diagrams, solving spatial puzzles, or imagining new designs. It opens up new possibilities for AI to help in fields like science, architecture, and sports strategy, where being able to visualize and refine ideas is crucial, making AI more creative and human-like in its problem solving.
Abstract
Thinking with Generated Images allows large multimodal models to generate and critique intermediate visual steps, enhancing visual reasoning capabilities and achieving significant improvements in complex scenarios.