LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, Jianguo Li, Tao Lin, Qi Qin, Hongjun Wang, Xiaomei Wang, Haoyuan Wu, Yi Xin, Junbo Zhao

2026-04-23

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Summary

This paper introduces LLaDA2.0-Uni, a new artificial intelligence model that can understand and create both text and images at the same time, all within a single system.

What's the problem?

Existing AI models often handle text and images separately, requiring different systems for each. This makes it difficult for them to truly *understand* the relationship between what things look like and what they are described as. Also, generating high-quality images can be slow and inefficient.

What's the solution?

The researchers built LLaDA2.0-Uni using a few key ideas. First, they convert images into a code-like format. Then, they use a powerful language model, similar to those used for text, to process both the text and the image codes. Finally, they have a decoder that turns the image codes back into actual, detailed images. They also sped up the image generation process with clever optimizations and a special training method.

Why it matters?

This model is important because it represents a step towards AI that can seamlessly work with different types of information – text, images, and potentially more – in a unified way. This could lead to more powerful and versatile AI systems capable of complex reasoning and creative tasks, like generating images from text descriptions or editing existing images based on instructions.

Abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

View Paper