ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality
Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
2024-12-06

Summary
This paper talks about ZipAR, a new method that speeds up the process of generating images using a technique called auto-regressive visual generation, allowing for faster and more efficient image creation.
What's the problem?
Generating images with existing auto-regressive models can be slow because they typically create one part of the image at a time, which requires many steps. This method can be inefficient, especially since parts of an image that are far apart don't affect each other much.
What's the solution?
The authors introduced ZipAR, which allows the model to generate multiple parts of an image at the same time by recognizing that nearby areas in an image are more related than distant ones. This is done by decoding adjacent visual tokens in parallel, which significantly reduces the number of steps needed to create an image. As a result, ZipAR can cut down the number of required model passes by up to 91% without needing any extra training.
Why it matters?
This research is important because it makes image generation much faster and more efficient. By improving how quickly and effectively images can be created, ZipAR could enhance applications in fields like animation, video games, and graphic design, allowing creators to produce high-quality visuals more rapidly.
Abstract
In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel, enabling the ``next-set prediction'' paradigm. By decoding multiple tokens simultaneously in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.