Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang

2024-11-28

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Summary

This paper discusses a new technique called Collaborative Decoding (CoDe) that makes Visual Auto-Regressive (VAR) models for image generation faster and more efficient.

What's the problem?

While VAR models are great for generating images, they can be slow and use a lot of memory because they generate images in many steps. This slow process can make it hard to create images quickly, especially when dealing with complex scenes.

What's the solution?

The authors propose CoDe, which combines the strengths of two different models: a large model that generates basic shapes and a smaller model that adds fine details. By working together, these models can produce high-quality images much faster and with less memory usage. CoDe reduces the number of steps needed to generate images while still maintaining good quality.

Why it matters?

This research is important because it improves how quickly and efficiently images can be generated using AI. Faster image generation can benefit many applications, such as video games, movies, and virtual reality, where high-quality visuals are essential but need to be produced quickly.

Abstract

In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at https://github.com/czg1225/CoDe

View Paper