TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li
2025-02-13
Summary
This paper talks about TextAtlas5M, a new dataset created to help AI models get better at making images with lots of text in them, like you might see in ads or infographics.
What's the problem?
Current AI models are good at making images from short text descriptions, but they struggle when asked to create images with longer, more complex text. This is because the datasets used to train these models usually only have examples with short, simple text.
What's the solution?
The researchers created TextAtlas5M, a huge collection of 5 million images that have long and complex text in them. These images come from all sorts of places and cover many different topics. They also made a special test set called TextAtlasEval with 3,000 images that humans helped improve, to really challenge the AI models.
Why it matters?
This matters because it could help make AI better at creating images that look more like the complex, text-heavy visuals we see in real life, like advertisements or informational posters. Even the best AI models right now have trouble with this task, so TextAtlas5M gives researchers a way to measure progress and improve their models. This could lead to more useful and realistic AI-generated images in the future.
Abstract
Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.