One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

Aleksandr Razin, Danil Kazantsev, Ilya Makarov

2025-11-14

One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

Summary

This paper introduces a new method called the Latent Upscaler Adapter, or LUA, to improve how diffusion models create high-resolution images.

What's the problem?

Diffusion models, which are really good at generating images, have trouble making images bigger than the size they were trained on. Trying to directly create high-resolution images is slow and expensive. Alternatively, you can upscale a lower-resolution image *after* it's created, but that often leads to noticeable flaws and takes extra time.

What's the solution?

The researchers developed LUA, which is a small addition to the existing diffusion model. Instead of upscaling the final image, LUA upscales the image's 'latent code' – a compressed representation of the image – *before* it's turned into a visible picture. This happens quickly and doesn't require changing the original model or adding more steps to the image creation process. It uses a special design that works well for both 2x and 4x upscaling and can be used with other upscaling techniques.

Why it matters?

LUA is important because it allows diffusion models to create high-resolution images much faster and with similar quality to other methods, but without the drawbacks of those methods. It’s also flexible and can work with different types of image encoders, meaning it’s easy to use with various diffusion models without needing to retrain everything from scratch, making high-quality, scalable image generation more practical.

Abstract

Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

View Paper