UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen
2026-03-04
Summary
This paper investigates whether having AI models *generate* things, like descriptions of images, actually helps them *understand* those images better. It questions if the recent trend of building models that can both understand and generate content is truly leading to improved comprehension.
What's the problem?
Currently, there aren't good ways to systematically test if the ability to generate content actually improves a model's understanding. Existing tests don't break down the specific situations where generation might be helpful, or even harmful, to understanding. Researchers need a way to pinpoint *when* and *why* generation aids comprehension.
What's the solution?
The researchers created a new, detailed benchmark called UniG2U-Bench. This benchmark includes 30 different tasks, grouped into 7 categories, that require models to use visual information in various ways – some tasks need simple observation, while others require complex reasoning about shapes, spaces, or even how images can trick your eyes. They then tested over 30 different AI models on these tasks. They found that generally, models that generate an answer first (Generate-then-Answer) don't perform as well as models that directly answer the question. However, generation *did* help with tasks involving spatial reasoning, visual illusions, and problems that require multiple steps of thinking.
Why it matters?
This research shows that simply making a model able to generate content doesn't automatically make it smarter. It highlights that we need to be more thoughtful about how we train these models, providing them with more diverse examples and exploring new training methods. Understanding when generation helps and when it hurts is crucial for building truly intelligent AI systems that can effectively process and understand the world around them.
Abstract
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.