Iterative Object Count Optimization for Text-to-image Diffusion Models

Oz Zafar, Lior Wolf, Idan Schwartz

2024-08-22

Iterative Object Count Optimization for Text-to-image Diffusion Models

Summary

This paper discusses a new approach called Iterative Object Count Optimization that helps text-to-image models accurately generate a specific number of objects in images.

What's the problem?

Text-to-image models often struggle to create the exact number of objects specified in a prompt because they learn from examples that don’t cover every possible count. This can lead to images that don’t match what users expect, especially when the number of objects is important.

What's the solution?

The authors propose a method that uses a counting loss derived from a specialized counting model to improve the accuracy of generated images. They introduce an iterative online training mode that adjusts how the model interprets the text prompt and dynamically changes important settings. This helps the model better understand how many objects to include, leading to more accurate image generation.

Why it matters?

This research is important because it enhances the ability of AI models to create images that meet specific requirements, which can be useful in various applications like art generation, advertising, and education. By improving how these models count and represent objects, it can lead to better user experiences and more reliable results.

Abstract

We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object\'s potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at https://ozzafar.github.io/count_token.

View Paper