< Explain other AI papers

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu

2026-03-11

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Summary

This paper introduces a new type of AI model called Omni-Diffusion that can understand and generate text, images, and speech all at once, unlike many current models that rely on a more traditional approach.

What's the problem?

Most current powerful AI models that handle multiple types of data, like images and text, are built using a specific architecture called an autoregressive model. While these work well, researchers believe there might be better, more efficient ways to design these models. Recent successes with 'diffusion models' in areas like image creation suggest they could be a strong alternative, but haven't been fully explored for handling multiple data types together.

What's the solution?

The researchers created Omni-Diffusion, a model completely based on a different type of AI called a 'mask-based discrete diffusion model'. This model learns to represent different types of data – text, speech, and images – as a set of building blocks, and then learns how those building blocks relate to each other. By using this unified approach, it can handle tasks involving any combination of these data types, not just pairs like image and text.

Why it matters?

This work shows that diffusion models are a promising foundation for building the next generation of AI that can seamlessly work with different kinds of information. Omni-Diffusion performs as well as, or even better than, existing models on various tests, suggesting this new architecture could lead to more powerful and versatile AI systems in the future.

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.