Self-Evaluation Unlocks Any-Step Text-to-Image Generation
Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan
2025-12-30
Summary
This paper introduces a new way to train computers to create images from text, called Self-E. It's a completely new approach, not building on existing methods, and it can generate images quickly and with good quality.
What's the problem?
Traditionally, creating images from text using AI requires a lot of computational steps to get a good result. Existing methods either need a lot of detailed guidance at each step, or they rely on a pre-trained 'teacher' model to learn from. This makes training complex and can limit how fast images can be generated. The goal was to create a system that could learn effectively without needing constant guidance or a pre-existing model, and that could generate images quickly without sacrificing quality.
What's the solution?
The researchers developed Self-E, which learns like a 'Flow Matching' model but with a key difference: it constantly checks its *own* work. As it generates an image, it evaluates how good it looks based on its current understanding, essentially teaching itself. This self-evaluation allows it to learn quickly and efficiently, bridging the gap between methods that need lots of guidance and those that need a teacher model. It doesn't need either, and can generate images well even with very few steps.
Why it matters?
This research is important because it's the first system that can create high-quality images from text, from scratch, and at any speed. It can generate images very quickly for applications where speed is crucial, but also take more time to create highly detailed images when needed. This unified approach makes text-to-image generation more flexible and scalable, potentially leading to faster and more accessible AI image creation tools.
Abstract
We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.