Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance
Kwanyoung Kim
2025-11-13
Summary
This paper introduces a new technique called Adversarial Sinkhorn Attention Guidance, or ASAG, to improve the quality of images generated by diffusion models, which are a type of AI that creates images from text or other inputs.
What's the problem?
Current methods for improving diffusion model outputs, like classifier-free guidance, work by intentionally making the model *worse* at generating images without any specific instructions. This is done using somewhat random tricks to disrupt the process, but there isn't a solid, logical reason why these tricks work, and they're often designed by trial and error. Essentially, they're hacking the system instead of understanding and improving it.
What's the solution?
ASAG takes a different approach by focusing on how the model *pays attention* to different parts of the image during creation. It uses a mathematical concept called optimal transport to deliberately weaken misleading connections within the model's attention mechanism. By subtly disrupting how the model relates different image elements, ASAG encourages it to create more realistic and coherent images, both when given specific instructions and when generating images freely. It does this by adding a cost to the attention process, making the model less likely to focus on irrelevant details.
Why it matters?
This method is important because it provides a more principled and reliable way to improve diffusion models. It's easy to add to existing models without needing to retrain them, and it consistently leads to better image quality, more control over the generation process, and improved performance in applications like IP-Adapter and ControlNet, which allow for even more specific image editing.
Abstract
Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.