BitDance: Scaling Autoregressive Generative Models with Binary Tokens
Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu, Ziyan Yang, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen
2026-02-17
Summary
This paper introduces BitDance, a new way to create images using a computer program that builds them up piece by piece, like writing a sentence word by word. It focuses on making this process faster and more efficient while still producing high-quality images.
What's the problem?
Existing methods for generating images in this 'piece-by-piece' way struggle with speed and efficiency. They often represent image parts using a limited set of options, which can limit the detail and realism. Also, choosing from a huge number of possibilities for each piece is computationally difficult, slowing down the process. Previous models were either slow or required a massive number of calculations to achieve good results.
What's the solution?
BitDance tackles these problems by representing each image part with a very large number of possibilities – essentially, a code with 256 bits. To handle this huge number of options, it uses a technique called 'binary diffusion,' which is like gradually refining a blurry image until it becomes clear. They also developed a new method called 'next-patch diffusion' that allows the program to predict multiple image parts at the same time, significantly speeding up image creation. This allows BitDance to generate images with fewer calculations than other models.
Why it matters?
BitDance is important because it achieves state-of-the-art image quality while being much faster and using fewer resources than previous methods. This means it's a step towards making high-quality image generation more accessible and practical. It also shows a promising direction for future research in building powerful 'foundation models' for image creation, and the code is publicly available for others to build upon.
Abstract
We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to 2^{256} states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.