Unified Latents (UL): How to train your latents
Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans
2026-02-20
Summary
This paper introduces a new way to create compressed representations of images and videos, called Unified Latents, that aims to be both high quality and efficient.
What's the problem?
Existing methods for compressing images and videos often struggle to balance quality and the amount of data needed to store them. They either produce blurry results when highly compressed, or require a lot of storage space to maintain good detail. Specifically, methods that use diffusion models, which are known for generating realistic images, can be computationally expensive and require a lot of processing power during training.
What's the solution?
The researchers developed a system where an 'encoder' compresses the image or video into a smaller 'latent' representation, and a 'decoder' reconstructs it. The key is that both the encoder and decoder are linked to a 'diffusion prior,' which acts like a guide to ensure the compressed representation contains all the necessary information for high-quality reconstruction. By carefully connecting the encoder's output to this guide, they simplified the training process and achieved better compression rates without sacrificing quality.
Why it matters?
This work is important because it allows for more efficient storage and transmission of images and videos, especially those generated by powerful AI models like Stable Diffusion. It achieves results comparable to or better than existing methods while using less computational power, making it more practical for real-world applications like video streaming, image editing, and AI-powered content creation.
Abstract
We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.