Multi-Scale Local Speculative Decoding for Image Generation
Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian
2026-01-09
Summary
This paper introduces a new method, Multi-Scale Local Speculative Decoding (MuLo-SD), to speed up the process of creating images using a type of artificial intelligence model called autoregressive models.
What's the problem?
Autoregressive models are really good at making images, but they do it step-by-step, which takes a lot of time. A technique called speculative decoding tries to make things faster by guessing parts of the image, but current methods struggle with figuring out what's right and wrong, and they don't really consider the overall structure of the image when making these guesses.
What's the solution?
MuLo-SD tackles this by first creating a rough, low-resolution draft of the image. Then, it uses a more detailed model to check if the draft is correct. If there are errors, instead of redoing everything from scratch, it focuses on fixing only the areas around the mistakes, using information about the image's local structure. This is done by intelligently upscaling the draft and then correcting errors in a focused way.
Why it matters?
This research is important because it significantly speeds up image generation with AI – up to 1.7 times faster than existing methods – without sacrificing the quality of the images. This means we can create high-quality images much more efficiently, which has implications for various applications like art, design, and virtual reality.
Abstract
Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to 1.7times - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.