Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos

2025-10-21

Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Summary

This paper investigates why techniques that speed up Large Language Models don't work as well for creating images, and finds a way to significantly improve image generation speed and quality using a specific type of model.

What's the problem?

Large Language Models have gotten much faster at generating text by cleverly 'searching' through possible outputs, but applying the same ideas to image generation hasn't been successful. In fact, simply picking options randomly often works better than these search methods when creating images with current technology like diffusion models. The core issue is that image generation is continuous, making it hard to efficiently narrow down the possibilities.

What's the solution?

The researchers focused on a different kind of image generation model called a visual autoregressive model, which builds images step-by-step in a more organized, discrete way. They used a technique called 'beam search' – essentially keeping track of the most promising image fragments as they're created – and found it dramatically improved the quality of the generated images. A smaller, 2 billion parameter autoregressive model actually outperformed a much larger, 12 billion parameter diffusion model. They also analyzed *why* this worked, discovering that the discrete nature of the model allows the computer to quickly discard bad options and reuse calculations.

Why it matters?

This work shows that the *way* an image generation model is built is just as important as its size. It suggests that focusing on model architecture, specifically using discrete, step-by-step approaches, is key to making image generation faster and better, and that simply scaling up existing continuous models might not be the most effective path forward.

Abstract

While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

View Paper