Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun
2026-03-16
Summary
This paper introduces Cheers, a new artificial intelligence model designed to both understand what's in images and create new images, all within a single system.
What's the problem?
Traditionally, getting a single AI to be good at *both* understanding images (like identifying objects) and generating images (like creating a picture from a description) is difficult. These tasks require different ways of processing information and representing what an image 'means', making it hard to train one model to do both effectively. Specifically, the fine details needed for realistic image creation can interfere with the broader understanding of the image's content.
What's the solution?
Cheers solves this by separating the 'big picture' meaning of an image from the small, detailed parts. It first compresses the image into a set of core concepts, then uses a large language model (similar to what powers chatbots) to process these concepts. When generating an image, it first creates a basic version based on these concepts and then adds back the detailed parts in a controlled way, ensuring the details enhance, rather than distort, the overall meaning. It also uses a clever method to compress the image information, making it more efficient.
Why it matters?
This research is important because it creates a more efficient and capable AI for handling both images and text. Cheers performs as well as or better than existing models while using significantly less computing power and training data. This means it's a step towards making powerful multimodal AI – AI that can understand and generate multiple types of data like images and text – more accessible and practical for a wider range of applications.
Abstract
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.