Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Shiyi Zhang, Yiji Cheng, Tiankai Hang, Zijin Yin, Runze He, Yu Xu, Wenxun Dai, Yunlong Lin, Chunyu Wang, Qinglin Lu, Yansong Tang

2026-04-29

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Summary

This paper introduces a new method called Meta-CoT to improve how well AI models can edit images based on instructions. It focuses on making the AI 'think through' the editing process in a more detailed and organized way, and then training it to generalize to new types of edits it hasn't seen before.

What's the problem?

Current AI models that edit images by 'thinking step-by-step' (Chain-of-Thought) are getting better, but it's not clear how to best structure those thought processes or how to train the AI to handle a wide variety of editing requests. Specifically, the models struggle with understanding *exactly* what needs to be changed in an image and applying that understanding to new, different editing tasks. They need to be able to break down complex edits into smaller, manageable steps and learn from a core set of editing skills.

What's the solution?

The researchers propose Meta-CoT, which breaks down image editing into two levels. First, any edit is seen as having a 'task' (what to do), a 'target' (what part of the image to change), and the 'understanding' needed to do it. The AI generates step-by-step reasoning for each of these. Second, they identified five basic types of editing tasks that cover a lot of ground. By training the AI on these five core tasks, along with the task and target specifics, it learns to generalize to many other editing requests. They also added a 'consistency reward' to make sure the AI's reasoning actually matches the changes it makes to the image.

Why it matters?

This work is important because it significantly improves the ability of AI to edit images accurately and flexibly. The 15.8% improvement across many editing tasks shows it's a substantial step forward. More importantly, the AI can now handle edits it wasn't specifically trained on, meaning it's more adaptable and useful in real-world applications. This could lead to better image editing tools and more powerful AI systems overall.

Abstract

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

View Paper