SpotEdit: Selective Region Editing in Diffusion Transformers

Zhibin Qin, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang

2025-12-30

SpotEdit: Selective Region Editing in Diffusion Transformers

Summary

This paper introduces a new method called SpotEdit for editing images using diffusion models, focusing on making the process more efficient and preserving image quality.

What's the problem?

Current image editing techniques using diffusion transformers often rework the entire image, even when only a small part needs changing. This is wasteful because it spends processing power on areas that are already good, and can even slightly mess up those untouched parts of the picture. The core issue is whether it's really necessary to completely recreate every single part of an image during editing.

What's the solution?

SpotEdit tackles this by only updating the regions of the image that are actually being edited. It does this in two main ways: first, it identifies the stable, unchanged parts of the image using how similar they look to the original. Then, it reuses information from the original image for those stable areas, avoiding unnecessary work. Second, it carefully blends the edited parts with the stable parts to make sure the final image looks natural and coherent, like a single, seamless picture.

Why it matters?

SpotEdit is important because it makes image editing with diffusion models much faster and more efficient. By focusing only on the areas that need to be changed, it saves computing resources and also helps maintain the quality of the parts of the image that weren't meant to be altered, leading to better and more precise edits.

Abstract

Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.

View Paper