Locality in Image Diffusion Models Emerges from Data Statistics

Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

2025-09-16

Locality in Image Diffusion Models Emerges from Data Statistics

Summary

This paper investigates why diffusion models, a type of image generation AI, behave differently than expected based on mathematical theory. It focuses on understanding where the gap lies between the 'perfect' way to denoise an image (according to the math) and how actual, trained AI models like UNets do it.

What's the problem?

Diffusion models have a theoretically perfect 'denoiser' – a way to remove noise from an image – but using this perfect denoiser just copies images from the training data instead of creating new, realistic ones. Previous research suggested this was because the AI models (UNets) use convolutional neural networks, which have built-in assumptions about images like expecting nearby pixels to be related. The question is, is that really why the theoretical denoiser fails to match the AI's performance?

What's the solution?

The researchers showed that the 'locality' – the tendency for nearby pixels to influence each other – isn't actually *caused* by the way UNets are built. Instead, locality naturally exists in real-world images themselves. They proved this by creating a simple, mathematically optimal denoiser that also exhibits this locality, even without using convolutional networks. Then, they used this understanding to build a better theoretical denoiser that more closely mimics the behavior of a trained UNet.

Why it matters?

This work helps us better understand how diffusion models actually work. By showing that locality comes from the data, not the model's design, it provides a more accurate foundation for improving these models and potentially creating even more realistic image generation techniques. It also suggests that we can build better theoretical models to predict and analyze the behavior of complex AI systems.

Abstract

Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

View Paper