Distribution Matching Variational AutoEncoder

Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, Han Hu

2025-12-09

Distribution Matching Variational AutoEncoder

Summary

This paper explores how the way images are compressed into a simpler form, called a 'latent space', affects how well we can generate new images. It introduces a new method for shaping this latent space to improve image generation quality.

What's the problem?

Current methods for compressing images into a latent space, like those used in many image generation models, don't really control the *type* of compression being done. They just squeeze the image down without thinking about whether a particular compression style would be better for recreating the image later. This makes it hard to know what the best way to represent images in this simplified form actually is, and limits how realistic the generated images can be.

What's the solution?

The researchers developed a new technique called Distribution-Matching VAE (DMVAE). This method allows them to specifically choose the type of distribution used in the latent space, instead of being stuck with the standard, often limiting, Gaussian distribution. They can match the latent space distribution to patterns found in the images themselves, or even to the 'noise' used in other image generation techniques. By experimenting with different distributions, they found that using distributions based on self-supervised learning features worked particularly well.

Why it matters?

This work shows that carefully designing the latent space – the way images are compressed – is crucial for high-quality image generation. It’s not just about squeezing the image down, but about choosing a compression style that makes it easier to recreate realistic images. Their results demonstrate a significant improvement in image quality, suggesting that focusing on the structure of the latent space is a key step towards creating even more powerful image generation models.

Abstract

Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce Distribution-Matching VAE (DMVAE), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.

View Paper