$ε$-VAE: Denoising as Visual Decoding

Long Zhao, Sanghyun Woo, Ziyu Wan, Yandong Li, Han Zhang, Boqing Gong, Hartwig Adam, Xuhui Jia, Ting Liu

2024-10-09

Summary

This paper introduces the ε-VAE model, which improves how images are generated by using a new method called denoising as decoding, allowing for better image quality through iterative refinement.

What's the problem?

Traditional methods for generating images often use a process where an encoder compresses the image data into a simpler form, and then a decoder reconstructs the original image. However, this can lead to loss of important details and lower quality images, especially when dealing with complex visual data.

What's the solution?

The authors propose a new approach where instead of a standard decoder, they use a diffusion process that gradually refines a noisy version of the image into a clearer one. This method allows for multiple steps of improvement, guided by the compressed data from the encoder. They tested this new model against existing methods and found that it produced higher quality images both in reconstruction and generation tasks.

Why it matters?

This research is significant because it offers a fresh perspective on how to improve image generation techniques. By enhancing the way images are reconstructed, ε-VAE can lead to better applications in fields such as computer graphics, virtual reality, and any area where high-quality image generation is crucial.

Abstract

In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approach. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.

View Paper