Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
Qi Mao, Lan Chen, Yuchao Gu, Mike Zheng Shou, Ming-Hsuan Yang
2025-04-09
Summary
This paper talks about UnifyEdit, a tool that helps AI edit images based on text descriptions while keeping the original image’s structure intact and making sure changes match the text instructions.
What's the problem?
Current AI image editors either change too much of the original image or don’t follow text instructions well, leading to images that look wrong or don’t match what was asked.
What's the solution?
UnifyEdit uses two smart checks—one to protect the original image’s layout and another to align edits with the text—plus a balancing system that adjusts how much each check matters depending on the edit type.
Why it matters?
This helps create AI tools that can edit photos accurately, like changing a dog’s color without messing up its pose, or swapping objects in a room while keeping the background intact.
Abstract
Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.