Group Relative Attention Guidance for Image Editing

Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, An-an Liu

2025-10-29

Group Relative Attention Guidance for Image Editing

Summary

This paper focuses on improving how we control the extent of changes when editing images using a relatively new type of image generation model called Diffusion-in-Transformer (DiT).

What's the problem?

Current image editing tools using DiT models struggle with giving users precise control over *how much* an image is changed. You can tell the model *what* to change, but not easily control the strength or intensity of that change, leading to edits that are either too subtle or too drastic.

What's the solution?

The researchers discovered a hidden pattern within the DiT model's attention mechanism. They found that the model has a built-in idea of how to edit, and differences from that idea represent the specific changes you want to make. They developed a technique called Group Relative Attention Guidance (GRAG) which essentially adjusts how much attention the model pays to these 'change signals', allowing for smooth and fine-tuned control over the editing process. It’s a simple addition to existing editing methods, requiring only a few lines of code.

Why it matters?

This work is important because it makes image editing with these powerful DiT models much more user-friendly and effective. GRAG provides a way to get exactly the edits you want, with more control than previous methods like Classifier-Free Guidance, leading to higher quality and more customized results.

Abstract

Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.

View Paper