PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

Denis Korzhenkov, Adil Karjauv, Animesh Karnewar, Mohsen Ghafoorian, Amirhossein Habibian

2026-01-09

PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

Summary

This research focuses on making pyramidal diffusion models, which are a new way to generate videos, more efficient and high-quality.

What's the problem?

Pyramidal diffusion models work by processing videos at different levels of detail, which saves computing power. However, building these models from scratch often results in videos that don't look very realistic compared to the best existing video generation systems. Essentially, they're fast but the quality suffers.

What's the solution?

The researchers developed a method to take existing, already well-trained video generation models and *convert* them into pyramidal models. This conversion requires only a small amount of additional training, and importantly, doesn't reduce the quality of the generated videos. They also experimented with different techniques to make the pyramidal models even faster at generating videos.

Why it matters?

This work is important because it provides a practical way to get the benefits of pyramidal models – speed and efficiency – without sacrificing the visual quality of the generated videos. It makes advanced video generation technology more accessible and potentially faster for a wider range of applications.

Abstract

Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.

View Paper