CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Chengzhuo Tong, Mingkun Chang, Shenglong Zhang, Yuran Wang, Cheng Liang, Zhizheng Zhao, Ruichuan An, Bohan Zeng, Yang Shi, Yifan Dai, Ziming Zhao, Guanbin Li, Pengfei Wan, Yuanxing Zhang, Wentao Zhang

2026-01-16

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Summary

This paper explores how to make text-to-image generation better by using ideas from video creation, specifically a technique called 'Chain-of-Frame' reasoning.

What's the problem?

Current text-to-image models are really good, but they don't show *how* they're thinking when creating an image. Video models, on the other hand, can create a series of frames that show a step-by-step process, like solving a maze. The problem is that no one has figured out how to apply this step-by-step thinking to text-to-image generation, because it's hard to define what those intermediate steps should even *be* when you're starting with just text.

What's the solution?

The researchers created a new model called CoF-T2I. It works by generating images one frame at a time, refining the image progressively with each step. Think of it like sketching, then adding details, then shading – each frame is a step in the process. To train this model, they also created a new dataset called CoF-Evol-Instruct, which shows how images should evolve from a simple idea to a finished, detailed picture. They also made sure each frame was processed independently to avoid blurry or jumpy results.

Why it matters?

This research shows that using video generation techniques can significantly improve the quality of text-to-image models. The results demonstrate that CoF-T2I performs better than existing models and achieves high scores on standard tests, suggesting that this approach has a lot of potential for creating even more realistic and detailed images from text prompts.

Abstract

Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.

View Paper