REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu
2025-12-01
Summary
This paper focuses on improving how well image editing models understand and follow instructions, ultimately making the edited images more accurate and aligned with what the user wants.
What's the problem?
Current image editing models use a powerful component called a multimodal large language model (MLLM) to understand instructions and images, but this component isn't actively *trained* during the editing process – it's kept fixed. This limits the model's ability to really 'think' about complex instructions and can lead to inaccurate or unintended changes when editing images.
What's the solution?
The researchers developed a new framework that 'unlocks' the reasoning abilities of the MLLM. They introduced two key ideas: 'thinking,' where the model uses its existing knowledge to interpret instructions, and 'reflection,' where the model reviews its edits, corrects mistakes, and decides when to stop editing. This creates a loop – the model thinks, edits, reflects, and repeats until the image is right. They tested this approach by adding it to existing models like Step1X-Edit and Qwen-Image-Edit.
Why it matters?
This work is important because it significantly improves the performance of image editing models. By allowing the model to reason about instructions and reflect on its work, the edits become much more accurate and reliable, outperforming previous methods on standard image editing benchmarks. This means better and more controllable image editing tools in the future.
Abstract
Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).