MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues
Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen
2025-12-03
Summary
This paper introduces MagicQuill V2, a new system for editing images using artificial intelligence. It aims to combine the best parts of two different approaches: the ability of AI to create realistic images and the precise control offered by traditional image editing software.
What's the problem?
Current AI image generators, like diffusion transformers, are really good at creating images from scratch, but they struggle when you want to make specific changes. They usually take one big instruction, and it's hard to tell the AI exactly *what* to change, *where* to change it, and *how* it should look. It's like trying to give one instruction that covers everything at once – it gets messy and doesn't give you enough control.
What's the solution?
MagicQuill V2 solves this by breaking down image editing into layers. Think of it like stacking transparent sheets, each controlling a different aspect of the image. There's a layer for *what* you want to add (content), a layer for *where* you want to put it (spatial), a layer for its *shape* (structural), and a layer for its *colors* (color). The researchers also created a way to generate training data that helps the AI understand how these layers interact and a special part of the AI that focuses on making precise, local edits, like removing an object from a picture.
Why it matters?
This is important because it gives users much more direct and intuitive control over AI image generation. Instead of struggling to write the perfect prompt, you can simply adjust different layers to get the exact result you want, making AI image editing more accessible and powerful for creators.
Abstract
We propose MagicQuill V2, a novel system that introduces a layered composition paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.