SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu

2025-10-15

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Summary

This paper introduces a new method called SRUM to improve how well AI models that combine vision and language can *create* images based on text descriptions, even though they're already good at *understanding* images.

What's the problem?

Current AI models are really good at looking at an image and understanding what's in it, or following instructions about an image. However, they often struggle to actually *generate* a new, accurate image from just a text description. It's like they can understand the concept but can't reliably draw it. The core issue is that the 'understanding' part of the model isn't helping the 'generating' part enough.

What's the solution?

The researchers created SRUM, which stands for Self-Rewarding Unified Multimodal Model. It's a system that lets the model essentially critique its own work. The model's 'understanding' component acts like an internal judge, giving feedback to the 'generating' component to help it improve. This happens automatically, without needing humans to label more data. SRUM uses a two-part reward system: one checks the overall image to make sure it makes sense, and another focuses on the details to make sure individual objects are correct.

Why it matters?

This work is important because it shows a way to make these AI models better at image generation without needing more labeled data, which can be expensive and time-consuming to collect. By allowing the model to learn from its own 'understanding' of images, it opens up possibilities for creating more accurate and detailed images from text prompts, and it establishes a new way to improve these types of AI systems.

Abstract

Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal ``evaluator'', providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a global reward ensures the correctness of the overall visual semantics and layout, while a local reward refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to 88.37 and on T2I-ReasonBench from 43.82 to 46.75. Overall, our work establishes a powerful new paradigm for enabling a UMMs' understanding module to guide and enhance its own generation via self-rewarding.

View Paper