Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

2025-07-14

Vision Foundation Models as Effective Visual Tokenizers for
Autoregressive Image Generation

Summary

This paper talks about using vision foundation models to create a better way to break down images into meaningful parts called tokens, which helps generate images more accurately and efficiently.

What's the problem?

Current methods of turning images into tokens for generative models can miss important details or be inefficient, which affects the quality of images generated and how well these models work for different classes of images.

What's the solution?

The researchers designed a new image tokenizer based on pre-trained vision models that can capture visual information better and produce fewer, more useful tokens. This improves the process of reconstructing and generating images, especially when the model has to create images conditioned on specific categories.

Why it matters?

This matters because it helps AI generate higher quality images with less computing power, making image generation faster, more precise, and more versatile for various creative and practical applications.

Abstract

A novel image tokenizer built on pre-trained vision foundation models improves image reconstruction, generation quality, and token efficiency, enhancing autoregressive generation and class-conditional synthesis.

View Paper