Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

Natalia Frumkin, Diana Marculescu

2025-09-10

Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

Summary

This paper introduces a new method, Q-Sched, to make creating images from text with AI much faster and cheaper without losing quality.

What's the problem?

Generating images from text using current AI models is really demanding on computers, requiring a lot of processing power and memory. These models, like Stable Diffusion XL, need to run many calculations, even for a single image. Simplifying the models through techniques like 'few-step diffusion' helps, but they still rely on large, complex components. Existing methods to reduce the model size often require a lot of initial, high-quality data to work well, which is also expensive to obtain.

What's the solution?

Instead of trying to shrink the AI model itself, Q-Sched changes *how* the image is created. It adjusts the steps the AI takes to go from random noise to a final image, essentially optimizing the 'recipe' for image generation. They also developed a new way to train these adjustments, called JAQ loss, which focuses on making sure the generated image matches the text description and looks good overall, using only a small amount of example prompts and avoiding the need for powerful computers during training.

Why it matters?

Q-Sched is important because it allows for high-quality image generation with significantly less computing power. This means people without access to expensive hardware can still create detailed images from text, and it opens the door for faster and more efficient image creation in various applications. The improvements in image quality compared to existing methods demonstrate that reducing model size and optimizing the generation process can work together effectively.

Abstract

Text-to-image diffusion models are computationally intensive, often requiring dozens of forward passes through large transformer backbones. For instance, Stable Diffusion XL generates high-quality images with 50 evaluations of a 2.6B-parameter model, an expensive process even for a single batch. Few-step diffusion models reduce this cost to 2-8 denoising steps but still depend on large, uncompressed U-Net or diffusion transformer backbones, which are often too costly for full-precision inference without datacenter GPUs. These requirements also limit existing post-training quantization methods that rely on full-precision calibration. We introduce Q-Sched, a new paradigm for post-training quantization that modifies the diffusion model scheduler rather than model weights. By adjusting the few-step sampling trajectory, Q-Sched achieves full-precision accuracy with a 4x reduction in model size. To learn quantization-aware pre-conditioning coefficients, we propose the JAQ loss, which combines text-image compatibility with an image quality metric for fine-grained optimization. JAQ is reference-free and requires only a handful of calibration prompts, avoiding full-precision inference during calibration. Q-Sched delivers substantial gains: a 15.5% FID improvement over the FP16 4-step Latent Consistency Model and a 16.6% improvement over the FP16 8-step Phased Consistency Model, showing that quantization and few-step distillation are complementary for high-fidelity generation. A large-scale user study with more than 80,000 annotations further confirms Q-Sched's effectiveness on both FLUX.1[schnell] and SDXL-Turbo.

View Paper