The Principles of Diffusion Models
Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon
2025-10-30
Summary
This paper explains the fundamental ideas behind diffusion models, a powerful new technique in machine learning for generating things like images, audio, and more. It shows how different approaches to diffusion models are actually connected by the same core mathematical principles.
What's the problem?
Creating realistic data, like images, is hard. Traditional methods often struggle to capture the complexity of real-world data. Existing generative models sometimes have issues with stability or producing diverse outputs. The core challenge is figuring out how to start with random noise and transform it into something meaningful and realistic.
What's the solution?
Diffusion models solve this by first taking real data and gradually adding noise to it until it becomes pure noise. Then, they learn to *reverse* this process – starting with noise and slowly removing it to reconstruct the original data. The paper explains three ways to think about this reversal: as step-by-step noise removal, as following a gradient towards more likely data, or as a smooth transformation guided by a learned 'velocity field'. Essentially, they learn how to 'undo' the noise process.
Why it matters?
Diffusion models are becoming incredibly important because they can generate high-quality, diverse data that rivals or even surpasses other methods. Understanding the underlying principles, as this paper does, is crucial for improving these models, making them more efficient, and controlling what they generate. This allows for more creative applications and a deeper understanding of how these powerful tools work.
Abstract
This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the monograph discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.