Boosting Latent Diffusion Models via Disentangled Representation Alignment

John Page, Xuesong Niu, Kai Wu, Kun Gai

2026-01-13

Boosting Latent Diffusion Models via Disentangled Representation Alignment

Summary

This paper focuses on improving how images are compressed into a simpler form, called a 'latent space', before being generated by AI models like Latent Diffusion Models. It's about making the compression process better so the AI can create higher quality images more efficiently.

What's the problem?

Currently, AI models use a component called a VAE to compress images. Researchers have been trying to improve VAEs by aligning them with larger, more powerful image understanding models. However, the paper argues that VAEs and the image generation models need *different* things from this compressed image representation. Image generators want a broad understanding of the image's content, while VAEs should focus on breaking down the image into its individual characteristics, like color or shape, in a very organized way. Using the same alignment method for both isn't ideal.

What's the solution?

The researchers propose a new VAE called Send-VAE. This VAE is specifically designed to create a compressed image representation that neatly separates different image characteristics. It does this by aligning the compressed image with a pre-trained image understanding model, but using a special 'mapper' network to translate the information in a way that emphasizes these individual characteristics. This allows the AI to both understand the overall image and easily control specific details during generation.

Why it matters?

This work is important because it leads to faster training and better image quality. By improving the way images are compressed, the AI can learn more efficiently and produce more realistic and detailed images, achieving state-of-the-art results on standard image generation benchmarks.

Abstract

Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.

View Paper