ProEdit: Inversion-based Editing From Prompts Done Right

Zhi Ouyang, Dian Zheng, Xiao-Ming Wu, Jian-Jian Jiang, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng

2025-12-29

ProEdit: Inversion-based Editing From Prompts Done Right

Summary

This paper introduces a new method, ProEdit, for editing images and videos based on text instructions without needing to retrain any models. It focuses on improving how edits are made to ensure the changes actually reflect what the user wants, rather than just sticking too closely to the original image.

What's the problem?

Current image and video editing techniques that use 'inversion' – essentially recreating an image within a model to allow for edits – often struggle to make significant changes. They tend to rely too much on the original image's information during the editing process. This means if you ask it to change something like the color of a shirt, it might not fully change it, or if you ask it to add another person, it might not do it convincingly because it's still too focused on the original image's details.

What's the solution?

ProEdit tackles this problem in two main ways. First, it uses something called 'KV-mix' which blends information from both the original and the edited parts of the image, allowing for changes in the edited area while keeping the background consistent. Second, it uses 'Latents-Shift' which slightly alters the underlying representation of the original image specifically in the area being edited, further reducing the influence of the original image and allowing for more dramatic and accurate changes. Essentially, it gives the editing process more freedom to create something new.

Why it matters?

This research is important because it makes image and video editing much more effective and flexible. By allowing for more substantial and accurate edits without needing to retrain complex models, ProEdit opens up possibilities for more creative control and better results in applications like photo editing, video manipulation, and content creation. Plus, it’s designed to work with existing editing tools, making it easy to integrate and use.

Abstract

Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

View Paper