VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
2025-12-12
Summary
This paper introduces a new method, VQRAE, for creating a single system that can both understand and generate images, essentially doing both tasks with one 'brain'. It focuses on how to represent images in a way that works well for both understanding what's in the image and creating new images.
What's the problem?
Currently, most AI systems that deal with images use separate methods for understanding images (like identifying objects) and generating images (like creating a picture from a description). This is inefficient. Researchers have tried combining these approaches, but often struggle to create a single system that excels at both. Existing methods often either prioritize understanding or generation, or require complex balancing techniques.
What's the solution?
VQRAE tackles this by using a special type of neural network called a Representation AutoEncoder with Vector Quantization. Think of it like compressing and decompressing an image, but in a smart way. It first learns to compress images into a set of 'codes' that represent the image's meaning. Then, it can use these codes to reconstruct the original image. Crucially, it's designed to create both continuous codes for understanding and discrete codes for generating new images, all within the same system. The training happens in two steps: first learning the codes, then refining the entire system to work even better. They also found that using a very large number of codes (a high-dimensional codebook) works surprisingly well for understanding images.
Why it matters?
This research is important because it moves us closer to building AI systems that can truly 'see' and 'create' like humans. A unified system is more efficient and potentially more powerful than separate systems. The ability to generate images is useful for many applications, like creating art, designing products, or even helping people with visual impairments, and a better understanding of images improves tasks like image search and object recognition.
Abstract
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.