Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye

2025-10-09

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Summary

This paper introduces Lumina-DiMOO, a new and freely available computer model that's really good at working with different types of data like text and images, and can both create new content and understand what it's looking at.

What's the problem?

Existing multi-modal models, meaning models that handle multiple types of data, often struggle with efficiency when generating content. Some methods are slow, while others don't work well with all kinds of tasks or data types. They either generate things step-by-step (autoregressive) or use a mix of methods, which isn't always ideal.

What's the solution?

The researchers created Lumina-DiMOO using a technique called 'fully discrete diffusion modeling'. Think of it like starting with random noise and gradually refining it into a clear image or text based on instructions. This method is faster and more flexible than previous approaches, allowing it to handle tasks like creating images from text, editing existing images, and understanding what's in an image.

Why it matters?

Lumina-DiMOO is important because it achieves better results than other similar open-source models, and the researchers are sharing the code and the model itself with the public. This allows other researchers and developers to build upon their work and advance the field of multi-modal AI, potentially leading to even more powerful and creative applications.

Abstract

We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.

View Paper