I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu
2025-02-18
Summary
This paper talks about ThinkDiff, a new way to make AI image generators smarter by teaching them to understand and reason about both text and images together. It's like giving the AI a brain upgrade so it can create more meaningful and logical pictures based on complex instructions.
What's the problem?
Current AI image generators are good at making pretty pictures, but they're not great at understanding the deeper meaning or logic behind what they're creating. They mostly focus on making the image look right pixel by pixel, without really 'thinking' about what the image means or how different parts should logically fit together.
What's the solution?
The researchers created ThinkDiff, which uses a clever trick to teach image-making AIs to think more like language-understanding AIs. Instead of trying to directly teach the image AI to reason, they use a language AI as a go-between. This helps the image AI learn to understand and reason about both text and images without needing tons of specially made training data. They tested ThinkDiff and found it got much better at understanding and creating images based on complex instructions, improving accuracy from about 19% to 46% on a tough test.
Why it matters?
This matters because it could make AI image generators much more useful and creative. Instead of just making pretty pictures, they could create images that tell stories, solve visual puzzles, or illustrate complex ideas. This could be huge for fields like education, advertising, or any area where we need to communicate ideas visually. It's a big step towards AI that can truly understand and create meaningful visual content.
Abstract
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the LLM decoder shares the same input feature space with diffusion decoders that use the corresponding LLM encoder for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.