QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, De-An Huang

2025-02-10

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive
Multimodal Understanding and Generation

Summary

This paper talks about QLIP, a new method that helps AI models understand and generate both text and images using a unified system. It improves how images are broken into smaller parts, called tokens, to work better with text descriptions.

What's the problem?

AI models often struggle to handle tasks that involve both text and images at the same time. Current methods for turning images into tokens either lose important details or don't align well with text, making it hard for the AI to understand or generate multimodal content effectively.

What's the solution?

The researchers developed QLIP, which uses a special technique called binary spherical quantization to break images into tokens while keeping their quality high. They trained the system to balance two goals: making sure the image tokens match their original visuals and aligning them with related text. This allows QLIP to perform tasks like generating images from text or describing images in words within a single model.

Why it matters?

This matters because it makes AI systems better at handling tasks that combine text and images, such as creating art from descriptions or answering questions about pictures. By unifying these abilities in one model, QLIP opens up new possibilities for more advanced and flexible AI applications in areas like education, design, and entertainment.

Abstract

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.

View Paper