Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Yaqi Li, Peng Chen, Mingyang Han, Bu Pi, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song

2025-08-26

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Summary

This paper focuses on improving how well AI models create images from text descriptions, specifically when those descriptions are complex or open to interpretation.

What's the problem?

Current AI models that generate images from text sometimes struggle when given detailed or vague instructions. They often receive feedback only *after* the entire image is created, making it hard to pinpoint which parts of the process went wrong and how to improve them. It's like trying to learn to play basketball and only getting told if you scored *after* the whole game is over – you don't know what specific actions to adjust.

What's the solution?

The researchers introduced a new method called Visual-Chain of Guidance, or Visual-CoG. This breaks down image creation into three steps: first, understanding what the text *means*, then refining the image as it's being built, and finally, evaluating the final result. Crucially, the AI gets feedback after *each* of these steps, allowing it to learn and adjust more effectively. They also created a new set of tests, VisCog-Bench, to specifically measure how well the AI understands the meaning of text prompts.

Why it matters?

This research is important because it makes AI image generation more accurate and capable of handling complex requests. The improvements shown in their tests suggest that providing feedback throughout the image creation process, rather than just at the end, leads to significantly better results. This could lead to AI tools that can create images that more closely match what people envision, opening up possibilities for art, design, and other creative fields.

Abstract

Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

View Paper