Latent Diffusion Model without Variational Autoencoder
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu
2025-10-20
Summary
This paper introduces a new way to create images using diffusion models, a type of AI that generates images from noise. It focuses on improving the process of turning images into a compressed representation that the AI can work with, and then back into a final image.
What's the problem?
Current image-generating AI often uses a system involving VAEs and diffusion models. While these create high-quality images, they are slow to train, slow to generate new images, and don't easily adapt to other vision-related tasks. The core issue is that the way images are compressed into a simpler form using VAEs doesn't clearly separate different visual concepts, making it hard for the AI to learn efficiently and generalize to new situations.
What's the solution?
The researchers developed a new model called SVG that skips the VAE step altogether. Instead, it uses a pre-trained AI (DINO) that's already good at understanding what's in an image to create a more organized and meaningful compressed representation. A small addition to the system captures fine details. By training the diffusion model directly on this better-organized representation, they speed up training, allow for faster image generation, and improve the overall quality of the images.
Why it matters?
This work is important because it offers a more efficient and versatile approach to image generation. By creating a system that understands images better from the start, it paves the way for AI that can not only generate realistic images quickly but also apply that understanding to a wider range of visual tasks, ultimately leading to more powerful and adaptable AI systems.
Abstract
Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.