MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng, Hang Hua, Jiebo Luo

2025-11-28

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Summary

This paper introduces a new way to help computers understand and follow complex instructions for editing images, using natural language like you would when talking to a person.

What's the problem?

Currently, when you tell a computer to edit an image using words, it often struggles with complicated requests. It has trouble understanding how different parts of the instruction relate to each other, or understanding what you mean when you refer to specific things *in* the image. This leads to edits that don't quite match what you intended, or don't make sense in the context of the picture.

What's the solution?

The researchers created something called MIRA, which stands for Multimodal Iterative Reasoning Agent. Think of it as a little 'thinking' module you can add to existing image editing programs. Instead of trying to figure out the whole edit at once, MIRA breaks down the instruction into small, step-by-step actions. After each step, it looks at the image to see how it changed and then decides what to do next, just like a person would. They also created a large dataset of image editing instructions to help MIRA learn how to do this effectively.

Why it matters?

This work is important because it makes image editing with words much more accurate and reliable. MIRA performs as well as, or even better than, some of the best (and often closed-source) image editing systems currently available, but it's designed to be easily added to existing open-source tools, making powerful image editing accessible to more people.

Abstract

Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

View Paper