Simplified and Generalized Masked Diffusion for Discrete Data
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis K. Titsias
2024-06-13

Summary
This paper discusses a new approach to using masked diffusion models for generating discrete data, such as text or images. The authors aim to simplify the process and improve the performance of these models compared to traditional methods.
What's the problem?
Existing methods for masked diffusion have been complicated and unclear, which makes it difficult to effectively train models. This complexity can lead to poor performance because the models may not learn as well as they could. Additionally, traditional autoregressive models, which generate data one piece at a time, have been the standard but can be inefficient and limited in their capabilities.
What's the solution?
The authors propose a simplified framework for masked diffusion models that focuses on using a clear mathematical approach. They introduce a new way to calculate the training objectives, making it easier for the model to learn from data. Their method allows for more flexible training schedules that adapt based on the state of the data being processed. The results show that their models outperform previous ones in generating text and images, achieving better scores in various benchmarks.
Why it matters?
This research is important because it enhances the effectiveness of masked diffusion models, making them more competitive with other generative models like autoregressive ones. By simplifying the training process and improving performance, this work could lead to better applications in areas like natural language processing and image generation, ultimately helping create more accurate and efficient AI systems.
Abstract
Masked (or absorbing) diffusion is actively explored as an alternative to autoregressive models for generative modeling of discrete data. However, existing work in this area has been hindered by unnecessarily complex model formulations and unclear relationships between different perspectives, leading to suboptimal parameterization, training objectives, and ad hoc adjustments to counteract these issues. In this work, we aim to provide a simple and general framework that unlocks the full potential of masked diffusion models. We show that the continuous-time variational objective of masked diffusion models is a simple weighted integral of cross-entropy losses. Our framework also enables training generalized masked diffusion models with state-dependent masking schedules. When evaluated by perplexity, our models trained on OpenWebText surpass prior diffusion language models at GPT-2 scale and demonstrate superior performance on 4 out of 5 zero-shot language modeling tasks. Furthermore, our models vastly outperform previous discrete diffusion models on pixel-level image modeling, achieving 2.78~(CIFAR-10) and 3.42 (ImageNet 64times64) bits per dimension that are comparable or better than autoregressive models of similar sizes.