< Explain other AI papers

Set Block Decoding is a Language Model Inference Accelerator

Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman

2025-09-08

Set Block Decoding is a Language Model Inference Accelerator

Summary

This paper introduces a new method called Set Block Decoding (SBD) to make large language models, which are good at predicting the next word in a sentence, faster and more efficient when generating text.

What's the problem?

Large language models are really powerful, but they take a lot of computing power and memory to actually *use* them, especially when they're creating longer pieces of text. The process of figuring out each word one at a time, called decoding, is particularly slow and resource-intensive.

What's the solution?

The researchers came up with SBD, which lets the model predict several future words *at the same time*, instead of just one after another. It combines two existing techniques – predicting the next word and filling in missing words – in a clever way. This allows them to use advanced problem-solving methods to speed things up without losing the quality of the generated text. Importantly, it doesn't require changing the basic structure of the language model or adding new settings to adjust during training; it can be applied to existing models with a little extra training.

Why it matters?

This is important because it makes these powerful language models more practical to use. By speeding up the text generation process by 3 to 5 times, SBD could allow more people to access and benefit from these technologies, and it could make it feasible to use them in situations where speed and efficiency are critical.

Abstract

Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.