GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

Phillip Y. Lee, Taehoon Yoon, Minhyuk Sung

2024-10-29

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

Summary

This paper introduces GrounDiT, a new method for improving text-to-image generation by using a technique called noisy patch transplantation to better control how images are created based on user-defined bounding boxes.

What's the problem?

Creating images from text descriptions can be challenging, especially when users want specific objects placed in certain areas of the image. Previous methods that allow for this kind of control often struggle to accurately position objects within their designated areas, leading to images that don't match the user's expectations. This lack of precision can result from the way these models are trained, which typically requires complex adjustments that can be inefficient and costly.

What's the solution?

GrounDiT offers a training-free approach to spatial grounding in image generation. It uses the Diffusion Transformer (DiT) architecture to generate noisy patches corresponding to each bounding box, allowing for precise control over where each object appears in the image. The method takes advantage of a property called 'semantic sharing,' where smaller patches and the overall image become closely related during the generation process. This allows the model to transplant these patches into the correct locations in the noisy image, ensuring that each object is accurately represented according to the user's specifications.

Why it matters?

This research is significant because it enhances how AI can generate images based on detailed user input. By improving control over object placement in generated images, GrounDiT can lead to better applications in fields like graphic design, video game development, and virtual reality, where precise visual representation is crucial.

Abstract

We introduce a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT). Spatial grounding with bounding boxes has gained attention for its simplicity and versatility, allowing for enhanced user control in image generation. However, prior training-free approaches often rely on updating the noisy image during the reverse diffusion process via backpropagation from custom loss functions, which frequently struggle to provide precise control over individual bounding boxes. In this work, we leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing. Due to semantic sharing, when a smaller patch is jointly denoised alongside a generatable-size image, the two become "semantic clones". Each patch is denoised in its own branch of the generation process and then transplanted into the corresponding region of the original noisy image at each timestep, resulting in robust spatial grounding for each bounding box. In our experiments on the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous training-free spatial grounding approaches.

View Paper