T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li

2025-05-02

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level
and Token-level CoT

Summary

This paper talks about T2I-R1, a new AI system that creates images from text descriptions and is designed to think through its process more carefully, making the pictures it generates more accurate and detailed.

What's the problem?

A lot of text-to-image generators struggle to fully understand complicated instructions or miss important details, so the images they make don't always match what people asked for.

What's the solution?

The researchers improved the AI by teaching it to reason at two levels—both about the big ideas in the description and the smaller details in the words—using reinforcement learning to help it learn from its mistakes and get better results.

Why it matters?

This matters because it means people can get more precise and creative images from their text prompts, which is useful for art, design, education, and any project that needs high-quality visuals made from written ideas.

Abstract

T2I-R1, a reasoning-enhanced text-to-image generator using RL and bi-level chain-of-thought reasoning, improves performance by 13% on T2I-CompBench and 19% on WISE compared to Janus-Pro.

View Paper