LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum

2025-09-16

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Summary

This paper introduces a new method called LazyDrag for editing images using diffusion models, focusing on making drag-based editing more accurate and capable.

What's the problem?

Current drag-based image editing techniques rely on the model *guessing* which parts of the image should move with your drag, using something called 'attention'. This guessing game isn't very reliable, leading to weak edits and requiring extra processing time during use to get good results. This limits what you can realistically edit and how well it looks, especially when trying to create something new within the image or make complex changes.

What's the solution?

LazyDrag solves this by creating a clear 'map' that explicitly shows the connection between where you drag and how the image should change. Instead of the model trying to figure it out, LazyDrag *tells* it exactly what to move. This allows for stronger, more precise edits without needing extra processing time, and it lets the model fully use its creative abilities. It also allows for edits that combine moving and resizing, and can handle multiple edits in a row.

Why it matters?

LazyDrag is important because it significantly improves the quality and complexity of edits possible with diffusion models. It opens the door to more natural and intuitive image editing, allowing users to make detailed changes like adding objects or altering features with greater control and realism. It also sets a new standard for performance in this type of image editing and suggests a new direction for future editing tools.

Abstract

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

View Paper