LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Jun Zhu, Zhijie Deng

2026-04-03

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Summary

This paper introduces a new type of unified model, called LatentUM, that's designed to better understand and create content involving different types of data like images and text.

What's the problem?

Current unified models struggle because they treat understanding and creating visual content as separate processes, requiring them to convert images into pixel data and back again as a middle step. This is slow and can lead to inaccuracies because the model isn't working directly with the meaning of the image, but rather its raw pixel values. It's like trying to understand a book by only looking at the ink on the page instead of the words themselves.

What's the solution?

LatentUM solves this by representing all types of data – images, text, etc. – in a single, shared 'semantic space'. Think of it as a common language all the data understands. This allows the model to directly reason about and generate content without needing to constantly convert images into pixels. It's a more efficient and accurate way to process information.

Why it matters?

This is important because it allows for more sophisticated AI systems that can not only generate images from text, but also understand complex visual scenes, plan actions in the real world, and even predict what will happen next in a video. By removing the pixel conversion bottleneck, LatentUM achieves better performance on tasks requiring visual reasoning and generation, and opens the door to more advanced applications.

Abstract

Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.

View Paper