Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho

2024-12-16

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Summary

This paper talks about ResGen, a new model that uses Residual Vector Quantization (RVQ) to generate high-quality images and sounds efficiently while maintaining fast processing speeds.

What's the problem?

Generating high-fidelity images or sounds typically requires a lot of data, which can slow down the process. Traditional methods often use many individual tokens (small units of data), leading to slower performance when trying to create new content. This makes it hard to balance quality and speed in generative models.

What's the solution?

The authors introduce ResGen, which employs RVQ to improve the generation process. Instead of focusing on individual tokens, ResGen predicts a group of tokens together, which helps speed up the generation while still producing high-quality results. They also use techniques like token masking and multi-token prediction to enhance performance. ResGen was tested on two tasks: generating images from a dataset called ImageNet and creating speech without prior examples, showing that it performs better than existing models.

Why it matters?

This research is important because it shows how to generate high-quality content more efficiently, which can benefit various applications like image creation, music production, and voice synthesis. By improving both the quality and speed of generative models, ResGen could lead to advancements in AI technologies used in entertainment, education, and more.

Abstract

We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at https://resgen-genai.github.io

View Paper