Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

Enshu Liu, Qian Chen, Xuefei Ning, Shengen Yan, Guohao Dai, Zinan Lin, Yu Wang

2025-10-28

Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

Summary

This paper introduces a new technique called Distilled Decoding 2 (DD2) to speed up the process of generating images using a type of AI model called image auto-regressive (AR) models.

What's the problem?

Image AR models are really good at creating realistic images, but they're slow because they generate images bit by bit, requiring many steps. A previous attempt to speed things up, called Distilled Decoding 1 (DD1), helped, but it still lost some image quality when trying to generate images in just one step and wasn't very adaptable to different situations.

What's the solution?

The researchers developed DD2, which learns to predict what the image should look like at each step without relying on a fixed rule like DD1. They essentially train a new network to mimic the 'thinking' of the original, slower AR model, focusing on predicting the best possible outcome for each part of the image based on what's already been generated. This 'teaching' process uses something called 'score distillation' to ensure the new network learns effectively.

Why it matters?

DD2 significantly speeds up image generation with AR models, allowing for nearly the same image quality as the original slow method but with a much faster one-step process. This is a big step towards making these powerful image generation models more practical for real-world applications where speed is important, and it opens the door for even faster and better image creation in the future.

Abstract

Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model which provides the ground truth conditional score in the latent embedding space at each token position. Based on this, we propose a novel conditional score distillation loss to train a one-step generator. Specifically, we train a separate network to predict the conditional score of the generated distribution and apply score distillation at every token position conditioned on previous tokens. Experimental results show that DD2 enables one-step sampling for image AR models with an minimal FID increase from 3.40 to 5.43 on ImageNet-256. Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67%, with up to 12.3times training speed-up simultaneously. DD2 takes a significant step toward the goal of one-step AR generation, opening up new possibilities for fast and high-quality AR modeling. Code is available at https://github.com/imagination-research/Distilled-Decoding-2.

View Paper