Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng

2025-01-24

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Summary

This paper talks about using a method called Chain-of-Thought (CoT) reasoning to make AI-generated images better. It's like teaching a computer to think step-by-step while creating pictures, similar to how humans might plan out a drawing.

What's the problem?

Current AI models are good at understanding complex tasks, but it's not clear if they can use the same thinking process to make better images. It's like knowing how to describe a beautiful sunset in words, but struggling to paint it accurately.

What's the solution?

The researchers tried three main things: they made the AI double-check its work while creating images, taught it to prefer certain styles using a method called Direct Preference Optimization, and combined these approaches. They also created special tools called PARM and PARM++ that help the AI judge its own work and fix mistakes as it goes along. Using these methods, they improved an existing AI model called Show-o, making it much better at creating images.

Why it matters?

This matters because it could make AI-generated images much more accurate and realistic. Imagine being able to describe any scene in words and have an AI create it exactly as you pictured, with all the right details. This could be huge for artists, designers, and anyone who needs to create visual content quickly. It's a big step towards making AI creativity more like human creativity, which could lead to new and exciting ways of making art and visual media.

Abstract

Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT

View Paper