UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao

2026-01-07

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Summary

This paper focuses on improving how well AI models that understand both images and text can *create* things based on that understanding, specifically generating images from text prompts.

What's the problem?

Current AI models are really good at looking at a picture and understanding what's happening, or reading a text description and knowing what it means. However, they often struggle to actually *use* that understanding to create a new, high-quality image that accurately reflects the input. It's like they can comprehend information but have trouble expressing it creatively – the authors call this 'Conduction Aphasia'.

What's the solution?

The researchers developed a system called UniCorn that helps these models improve themselves without needing extra training data or someone to supervise the process. UniCorn works by splitting the model into three parts: one that proposes ideas for the image, one that tries to solve the problem of creating the image, and one that judges how good the result is. They then have these parts work together, playing off each other to refine the image generation process. This 'self-play' helps the model learn to better translate its understanding into a visual output.

Why it matters?

This work is important because it shows a way to make AI image generators significantly better at creating images that are both accurate to the text prompt and visually appealing. It does this without needing huge amounts of extra data or human intervention, which makes it a scalable and efficient approach to improving AI's overall intelligence and creative abilities.

Abstract

While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

View Paper