RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia

2025-12-19

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Summary

This paper introduces a new method called RePlan for editing images based on text instructions, especially when the instructions are complicated and the images themselves are busy or unclear.

What's the problem?

Current image editing models struggle when given detailed instructions for images that have a lot going on or where it's hard to pinpoint exactly what needs to be changed. They have trouble understanding *where* to make changes and *how* to make them accurately in complex scenes, a challenge the authors call 'Instruction-Visual Complexity'.

What's the solution?

RePlan works in two main steps: planning and executing. First, a 'planner' breaks down the text instruction into smaller, more manageable steps and identifies the specific areas of the image that need to be modified. Then, an 'editor' makes those changes using a clever technique that doesn't require extra training – it directly adjusts the important parts of the image based on the planner's instructions, allowing for multiple edits to happen at once. The planning step is improved by using a type of learning called reinforcement learning, which helps it become better at understanding and following instructions.

Why it matters?

This research is important because it makes image editing with text instructions much more reliable and precise, even in difficult situations. It also introduces a new benchmark dataset, IV-Edit, to specifically test these kinds of complex editing tasks, pushing the field forward and allowing for better evaluation of future models. Ultimately, it brings us closer to being able to easily edit images just by telling a computer what we want.

Abstract

Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

View Paper