Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models
Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
2025-12-17
Summary
This paper introduces a faster way to use a type of artificial intelligence model called Masked Discrete Diffusion Models, or MDMs, which are good at tasks involving different kinds of data like images and text.
What's the problem?
MDMs, while powerful, can be slow because they repeatedly process a lot of unnecessary information during the generation process. Specifically, they keep working with masked parts of the data even when those parts don't really change the final result, wasting computing time.
What's the solution?
The researchers developed a new framework called Sparse-LaViDa that intelligently cuts out these unnecessary masked parts of the data as the model is working. To make sure the final result is still good, they use special 'register tokens' to remember the important information from the cut-out parts. They also adjusted how the model learns to match this faster process, ensuring it works well both during training and when actually generating results.
Why it matters?
This work is important because it makes MDMs significantly faster – up to twice as fast – without sacrificing the quality of the images, text, or other outputs. This speedup makes these powerful models more practical for real-world applications like creating images from text, editing existing images, and even solving math problems.
Abstract
Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.