DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li

2025-12-05

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Summary

This paper introduces a new method called DraCo for creating images from text descriptions, building on the recent advances in large language models that can handle both text and images.

What's the problem?

Current methods for generating images from text often struggle with detailed planning and accurately combining specific features, especially rarer ones. They either treat the image creation as a single step or rely on text-based plans that aren't concrete enough to guide the process effectively, leading to images that don't quite match what was asked for.

What's the solution?

DraCo works by first creating a quick, low-resolution draft image as a visual preview. This gives the model a better idea of the overall structure and layout. Then, the model checks if the draft image actually makes sense with the original text description, identifying any mismatches. Finally, it refines the image, focusing on correcting errors and adding detail using a process called super-resolution. They also created a dataset, DraCo-240K, to help train the model and a special technique, DraCo-CFG, to improve the reasoning process.

Why it matters?

This research is important because it significantly improves the quality and accuracy of images generated from text. DraCo outperforms existing methods on several benchmarks, meaning it can create images that are more faithful to the original descriptions and handle complex requests more effectively, pushing the boundaries of what's possible with AI image generation.

Abstract

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

View Paper