DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents
Yilun Xu, Gabriele Corso, Tommi Jaakkola, Arash Vahdat, Karsten Kreis
2024-07-04

Summary
This paper talks about DisCo-Diff, a new method that improves how diffusion models generate data by using a combination of continuous and discrete variables, making the learning process easier and more effective.
What's the problem?
The main problem is that traditional diffusion models, which are used to create new data from existing data, often struggle when trying to represent complex information. They typically use a single continuous distribution (like a smooth curve) to encode all types of data, which can be very challenging and inefficient, especially when dealing with complicated or varied data.
What's the solution?
To solve this issue, the authors introduced DisCo-Diff, which adds discrete latent variables (think of these as distinct categories or 'buckets' for organizing information) alongside the continuous ones. This combination allows the model to better capture different aspects of the data while simplifying the learning process. The discrete variables help reduce the complexity of mapping noise to actual data, making it easier for the model to learn and generate high-quality outputs. The authors tested DisCo-Diff on various tasks and found that it consistently performed better than traditional methods.
Why it matters?
This research is important because it enhances the capabilities of generative models, which are used in many applications like image synthesis and data generation. By making it easier for these models to learn from complex data, DisCo-Diff can lead to better quality outputs and more efficient processes in fields such as artificial intelligence and machine learning.
Abstract
Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion process to encode data into a simple Gaussian distribution. However, encoding a complex, potentially multimodal data distribution into a single continuous Gaussian distribution arguably represents an unnecessarily challenging learning problem. We propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. We augment DMs with learnable discrete latents, inferred with an encoder, and train DM and encoder end-to-end. DisCo-Diff does not rely on pre-trained networks, making the framework universally applicable. The discrete latents significantly simplify learning the DM's complex noise-to-data mapping by reducing the curvature of the DM's generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, a simple step because DisCo-Diff requires only few discrete variables with small codebooks. We validate DisCo-Diff on toy data, several image synthesis tasks as well as molecular docking, and find that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with ODE sampler.