Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Lital Binyamin, Yoad Tewel, Hilit Segev, Eran Hirsch, Royi Rassin, Gal Chechik

2024-06-17

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Summary

This paper presents a new approach called CountGen, which improves text-to-image generation by accurately controlling the number of objects in the generated images. It addresses the challenge of ensuring that the right number of objects is depicted based on the text prompts provided.

What's the problem?

Text-to-image models often struggle to generate the correct number of objects in an image. This is particularly important for tasks like creating illustrations for children's books or technical documents, where having the right number of items is crucial. The challenge arises because these models need to recognize and differentiate between multiple identical objects, even when they overlap, which can complicate the generation process.

What's the solution?

To solve this problem, the authors developed CountGen, which identifies specific features within the diffusion model that help keep track of each object's identity. During the image generation process, CountGen counts and separates instances of objects to ensure they match the prompt. If any objects are missing, it uses a trained model to predict their shape and location based on existing objects in the scene. This method allows CountGen to create layouts that are based on both the input text and random seeds, improving overall accuracy.

Why it matters?

This research is significant because it enhances the ability of AI models to generate images that meet specific requirements, such as having an exact number of objects. By improving how these models work, CountGen can be applied in various fields like education, publishing, and content creation, making AI-generated images more useful and reliable for real-world applications.

Abstract

Despite the unprecedented success of text-to-image diffusion models, controlling the number of depicted objects using text is surprisingly hard. This is important for various applications from technical documents, to children's books to illustrating cooking recipes. Generating object-correct counts is fundamentally challenging because the generative model needs to keep a sense of separate identity for every instance of the object, even if several objects look identical or overlap, and then carry out a global computation implicitly during generation. It is still unknown if such representations exist. To address count-correct generation, we first identify features within the diffusion model that can carry the object identity information. We then use them to separate and count instances of objects during the denoising process and detect over-generation and under-generation. We fix the latter by training a model that predicts both the shape and location of a missing object, based on the layout of existing ones, and show how it can be used to guide denoising with correct object count. Our approach, CountGen, does not depend on external source to determine object layout, but rather uses the prior from the diffusion model itself, creating prompt-dependent and seed-dependent layouts. Evaluated on two benchmark datasets, we find that CountGen strongly outperforms the count-accuracy of existing baselines.

View Paper