How far can we go with ImageNet for Text-to-Image generation?

L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton

2025-03-03

How far can we go with ImageNet for Text-to-Image generation?

Summary

This paper talks about a new way to create AI models that can turn text descriptions into images using a much smaller dataset called ImageNet, instead of the huge datasets that are typically used.

What's the problem?

Current text-to-image AI models need enormous amounts of data and computing power to work well. They use billions of images scraped from the internet, which can lead to issues with data quality, copyright, and inappropriate content. This approach is not sustainable and limits who can develop these technologies.

What's the solution?

The researchers found a way to use ImageNet, a smaller, high-quality dataset of 1.2 million images, and make it work just as well as the bigger datasets. They did this by cleverly enhancing the images and text descriptions in ImageNet. They added detailed captions to the images and used techniques to combine and modify images in ways that help the AI learn better. This allowed them to create a model that performs as well as or better than models trained on much larger datasets.

Why it matters?

This matters because it shows that we don't always need massive amounts of data to create good AI models. By using smarter techniques with smaller, higher-quality datasets, we can make AI development more accessible and environmentally friendly. It could lead to faster progress in AI research and allow more people to participate in developing these technologies without needing huge resources.

Abstract

Recent text-to-image (T2I) generation models have achieved remarkable results by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over quality. We challenge this established paradigm by demonstrating that strategic data augmentation of small, well-curated datasets can match or outperform models trained on massive web-scraped collections. Using only ImageNet enhanced with well-designed text and image augmentations, we achieve a +2 overall score over SD-XL on GenEval and +5 on DPGBench while using just 1/10th the parameters and 1/1000th the training images. Our results suggest that strategic data augmentation, rather than massive datasets, could offer a more sustainable path forward for T2I generation.

View Paper