Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Yoad Tewel, Rinon Gal, Dvir Samuel Yuval Atzmon, Lior Wolf, Gal Chechik

2024-11-12

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Summary

This paper introduces Add-it, a new method for adding objects to images based on text instructions without needing any additional training.

What's the problem?

Adding objects to images in a way that looks natural and fits well with the existing scene is a difficult task. Existing methods often struggle to find the right spot for new objects, especially in complex images, and they usually require extensive training to get good results.

What's the solution?

Add-it solves this problem by using pretrained diffusion models and a special technique called weighted extended-attention. This method takes into account three key sources of information: the original image, the text prompt describing what to add, and the newly generated image. By balancing these inputs, Add-it can seamlessly insert objects into images without needing extra training. The authors also created a new benchmark called the 'Additing Affordance Benchmark' to evaluate how well objects are placed in images.

Why it matters?

This research is important because it allows for quick and effective image editing without the need for complex training processes. This can be very useful in fields like graphic design, gaming, and virtual reality, where adding realistic objects to images is often required. The method has shown to be preferred by users over 80% of the time compared to other methods, indicating its effectiveness.

Abstract

Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.

View Paper