Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

Songlin Yang, Xianghao Kong, Anyi Rao

2026-04-14

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

Summary

This paper investigates why combining large language models, which are good at reasoning, with vision models, which are good at creating images, doesn't work as well as expected. The researchers call this issue 'pseudo-unification' because it *looks* like the models are working together, but they aren't truly combining their strengths.

What's the problem?

The main problem is that even though these 'unified multimodal models' are designed to leverage the best of both language and vision, they don't actually transfer the reasoning abilities of language models to the task of generating images. When asked to create images based on reasoning, they often perform poorly and give inconsistent results. Existing methods for understanding *why* this happens either don't look closely enough at what's going on inside the model, or they ignore how the initial prompt affects the final output.

What's the solution?

To figure out what's going wrong, the researchers developed a new way to analyze these models using information theory. This method looks at how the models process information from both images and text, and how they use that information to create their responses. They applied this analysis to ten different models and found two key issues: first, the models process image and text information differently, with images having less variation in how they're understood. Second, when generating text, the models are creative and explore many possibilities, but when generating images, they focus on being accurate and don't explore as much. Models that perform better are those that treat both image and text information more consistently.

Why it matters?

This research is important because it shows that simply combining language and vision models isn't enough to create true AI synergy. It highlights that a genuine combination requires a consistent flow of information between the different parts of the model, not just sharing the same underlying structure. This understanding can help developers build better multimodal models that can truly reason and create.

Abstract

Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.

View Paper