Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
2025-09-12
Summary
This paper presents a new way to think about how computers can both 'understand' images and then 'generate' new images based on that understanding, framing it like an encoder-decoder system similar to how auto-encoders work in machine learning.
What's the problem?
Current systems that handle both images and text often treat understanding (image to text) and generation (text to image) as separate tasks, leading to inconsistencies and limitations in how well they work together. They don't effectively leverage the information gained from one process to improve the other, and there wasn't a good way to measure how 'unified' these systems truly were.
What's the solution?
The researchers created a framework called UAE, which trains an 'understanding' component (encoder) to turn images into detailed text descriptions, and a 'generation' component (decoder) to recreate images from those descriptions. They use a technique called reinforcement learning to make both parts better – first getting them started, then having the understanding component create descriptions that help the generation component, and finally refining the generation component to use all the details in those descriptions. They also built a new testing benchmark, Unified-Bench, to specifically evaluate how well these combined systems perform.
Why it matters?
This work is important because it shows that by forcing a strong connection between understanding and generation, you can create AI systems that are better at both. The surprising result was that as the system learned, the 'understanding' part started writing increasingly detailed descriptions, and the 'generation' part simultaneously got better at actually using those descriptions to create high-quality images, demonstrating a real synergy between the two processes.
Abstract
In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.