End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Wenda Chu, Bingliang Zhang, Jiaqi Han, Yizhuo Li, Linjie Yang, Yisong Yue, Qiushan Guo

2026-05-04

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Summary

This paper focuses on improving how computers understand and generate images by using a method called autoregressive modeling, which essentially predicts the next part of an image based on what it's already seen.

What's the problem?

Traditionally, building these image-generating systems involved two separate steps: first, creating a way to break down an image into smaller, manageable pieces (like tokens in language), and then training a model to put those pieces back together and create new images. This two-step process wasn't ideal because the two parts weren't optimized to work *together*.

What's the solution?

The researchers developed a new system where both the image breakdown process (the tokenizer) and the image generation process are trained *at the same time*. This allows the system to learn the best way to represent images for generating new ones. They also explored using existing, powerful image understanding models to help improve the initial breakdown of images into tokens.

Why it matters?

This research is important because it leads to better image generation. Their system achieved a top-level score on a standard image quality test, meaning it can create more realistic and detailed images than previous methods, and it does so without needing extra tricks or guidance during the generation process.

Abstract

Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.

View Paper