ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Zixin Yin, Ling-Hao Chen, Lionel Ni, Xili Dai

2025-10-21

ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Summary

This paper introduces a new method, ConsistEdit, for editing images and videos based on text prompts. It builds upon recent advancements in image generation technology to create a more reliable and precise editing process.

What's the problem?

Current text-guided image and video editing tools often struggle to make strong changes while still keeping the overall image consistent and realistic. When editing over multiple steps or regions, errors can build up. Also, many methods treat the entire image as one unit, making it hard to change specific details like texture without affecting other parts of the image.

What's the solution?

The researchers analyzed a new image generation architecture called MM-DiT and discovered key insights into how its attention mechanisms work. They then developed ConsistEdit, which cleverly controls these attention mechanisms. It focuses on using visual information alone, guides the editing process with masks, and manipulates different parts of the attention process in unique ways to ensure edits are consistent with the original image and accurately reflect the text prompt. Importantly, it works across all stages of image creation without needing manual adjustments.

Why it matters?

ConsistEdit represents a significant step forward in image and video editing because it's more reliable, consistent, and allows for finer control than previous methods. It can handle complex edits over multiple steps and regions, and it allows users to adjust how much the structure of the image is changed, opening up possibilities for more creative and precise image manipulation.

Abstract

Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.

View Paper