MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard
2025-10-31
Summary
This paper introduces a new way to train AI models that create images from text, aiming for better results that more closely match what people actually want to see.
What's the problem?
Current text-to-image AI models are trained using huge collections of images found online, which leads to a lot of variety but doesn't necessarily mean the images are *good* or what a user specifically asked for. A common fix is to have a separate system 'judge' the generated images and discard the bad ones, but this throws away useful information, can limit the creativity of the AI, and isn't very efficient. Focusing on just one 'reward' or preference also makes the AI less flexible.
What's the solution?
Instead of filtering images *after* they're created, the researchers developed a method called MIRO that trains the AI model to directly learn user preferences *during* the image creation process. They do this by feeding the model information from multiple 'reward' systems, essentially teaching it what makes an image appealing from different angles. This allows the AI to generate higher quality images and learn much faster.
Why it matters?
This research is important because it improves the quality and efficiency of text-to-image AI. MIRO creates images that people prefer and does so more quickly than existing methods, achieving top performance on standard tests and user evaluations. This could lead to more useful and enjoyable AI image generation tools in the future.
Abstract
Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).