Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
2025-02-28
Summary
This paper talks about a new way to make AI generate images called xAR. It's like teaching a computer to paint a picture by thinking about different sized brush strokes instead of just tiny dots.
What's the problem?
Current AI models that create images have two main issues. First, they work by predicting one small piece (or 'token') of the image at a time, which isn't always the best way to think about 2D images. Second, they can make mistakes that add up over time because of how they're trained, which is called 'exposure bias'.
What's the solution?
The researchers created xAR, which allows the AI to predict different sized pieces of the image, from small patches to whole sections or even the entire image. They also changed how the AI learns, using a method called 'Noisy Context Learning' that helps prevent errors from building up. This makes the AI more flexible and accurate in creating images.
Why it matters?
This matters because xAR can create high-quality images much faster than current methods. It's also more efficient, using less computer power to get better results. This could lead to improvements in many areas that use AI-generated images, like creating art, designing products, or even helping with medical imaging. The speed and quality improvements could make it easier and cheaper to use AI image generation in real-world applications.
Abstract
Autoregressive (AR) modeling, known for its next-<PRE_TAG>token prediction</POST_TAG> paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch <PRE_TAG>token</POST_TAG>, a cell (a ktimes k grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20times faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2times faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.