FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
2025-09-12
Summary
This paper addresses a key issue slowing down the progress of open-source AI that creates images from text. It introduces a new, very large dataset and a way to thoroughly test how well these AI models are doing, aiming to close the gap between open-source and the more powerful, but closed-off, systems.
What's the problem?
Currently, open-source text-to-image models aren't as good as those developed by big companies. This is largely because there hasn't been enough high-quality data available to train them, especially data that focuses on complex reasoning skills needed to create images accurately from detailed text descriptions. Also, there wasn't a good, standardized way to *measure* how well these models actually perform on tasks requiring reasoning.
What's the solution?
The researchers created FLUX-Reason-6M, a massive dataset of 6 million images with 20 million detailed descriptions in both English and Chinese. These descriptions aren't just simple labels; they break down *how* the image should be generated, step-by-step, like a 'chain of thought'. Creating this dataset required a huge amount of computing power – 15,000 days worth on powerful GPUs. They also built PRISM-Bench, a new benchmark with seven different tests, including a very challenging one that requires models to follow long, complex instructions. This benchmark uses other AI models to judge the images based on how well they match the text and how visually appealing they are.
Why it matters?
This work is important because it provides the open-source community with resources that were previously only available to large tech companies. By releasing the dataset, the benchmark, and the code, they're hoping to accelerate the development of better, more capable open-source text-to-image models that can understand and respond to complex requests, ultimately making this technology more accessible and useful for everyone.
Abstract
The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .