One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin

2026-03-13

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Summary

This paper introduces a new technique called Elastic Latent Interface Transformer, or ELIT, which improves how diffusion transformers generate images by making them more efficient and flexible.

What's the problem?

Current diffusion transformers, while good at creating images, have a couple of key issues. First, the amount of computing power needed increases directly with the size of the image, making it hard to balance speed and quality. Second, they treat all parts of an image equally, even though some areas are more important than others for the overall picture, wasting processing power on less crucial details.

What's the solution?

ELIT solves these problems by adding a 'latent interface' – essentially a flexible set of tokens – between the initial image and the transformer part of the model. This interface allows the model to focus on the most important parts of the image first, and then add details later. The model learns to order these tokens so that the early ones capture the overall structure and later ones refine the details. Importantly, you can adjust how many of these tokens are used depending on how much computing power you have, allowing for a trade-off between speed and quality without changing the core model.

Why it matters?

This work is significant because it makes diffusion transformers more practical for real-world applications. By decoupling image size from computing cost and focusing resources on important areas, ELIT allows for faster image generation and better quality images, especially when computing resources are limited. The improvements in metrics like FID and FDD scores demonstrate a clear advancement in image generation performance.

Abstract

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of 35.3% and 39.6% in FID and FDD scores. Project page: https://snap-research.github.io/elit/

View Paper