Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu, Jiachen Lei, Jiahong Wu, Xiangxiang Chu, Jufeng Yang

2026-04-21

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Summary

This paper explores how to create images from text descriptions using a technique called MeanFlow, which has previously been successful at generating images from just category labels like 'dog' or 'cat'.

What's the problem?

While MeanFlow works well with simple categories, it struggles when you try to use more detailed text prompts. The issue is that MeanFlow only makes a few adjustments to the image, so the text information needs to be *very* clear and distinct for the model to understand it. Powerful text-understanding programs (called LLMs) aren't automatically helpful because they provide too much nuanced information for MeanFlow's limited refinement process to handle effectively.

What's the solution?

The researchers realized the text information needed to be incredibly precise for MeanFlow to work. They used a specific type of LLM that's good at creating clear, semantic representations of text and then adjusted the MeanFlow process to better utilize this type of information. This allowed them to successfully generate images from text prompts for the first time using this method, and they also showed it improves performance with other image generation techniques like diffusion models.

Why it matters?

This work is important because it opens up the possibility of creating images from detailed text descriptions using MeanFlow, which was previously limited to simple categories. It provides a practical guide for future researchers looking to combine text and image generation, and the code is publicly available for others to build upon.

Abstract

Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.

View Paper