When Do Diffusion Models learn to Generate Multiple Objects?

Yujin Jeong, Arnas Uselis, Iro Laina, Seong Joon Oh, Anna Rohrbach

2026-05-04

When Do Diffusion Models learn to Generate Multiple Objects?

Summary

This paper investigates why text-to-image AI models, while good at creating realistic images, often struggle when asked to create scenes with multiple objects and specific relationships between them.

What's the problem?

Current AI image generators are unreliable when creating images with many objects. It's not clear *why* they fail – is it because the training data doesn't have enough examples of all the objects, or is it something more fundamental about how these models learn to put things together in a scene? Specifically, the paper looks at whether the models struggle with simply recognizing individual objects (concept generalization) or with combining objects in new ways they haven't seen before (compositional generalization).

What's the solution?

The researchers created a special dataset called 'mosaic' where they have complete control over the objects, their attributes, and how they're arranged in a scene. They then trained AI image generators on this dataset, carefully changing the amount of data and the types of combinations the model saw during training. By doing this, they could isolate whether the difficulty came from the complexity of the scene, the imbalance of objects in the data, or the challenge of accurately 'counting' objects. They found that scene complexity and accurately counting objects were the biggest hurdles, and that the ability to combine concepts breaks down as more combinations are left out of the training data.

Why it matters?

This research shows that there are fundamental limits to how well current AI image generators can handle complex scenes. It suggests that simply adding more data isn't enough; we need to develop new techniques and better ways to design training data that will help these models learn to reliably create images with multiple objects and specific relationships between them.

Abstract

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

View Paper