Mobile Video Diffusion
Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian
2024-12-11

Summary
This paper talks about Mobile Video Diffusion, a new model designed to generate videos quickly and efficiently on mobile devices without needing a lot of computing power.
What's the problem?
Video generation models usually require a lot of processing power, which makes it hard for them to run on mobile devices like smartphones. This limits the ability to create high-quality videos on the go, making these advanced tools less accessible to everyday users.
What's the solution?
The authors developed MobileVD, a mobile-optimized version of video diffusion models. They achieved this by reducing the size of the images processed and using smart techniques to lower memory usage and speed up the generation process. This model can create videos much faster—about 523 times more efficient—while still maintaining good quality, allowing it to generate a video clip in just 1.7 seconds on a smartphone.
Why it matters?
This research is important because it makes advanced video generation technology accessible to more people by enabling it to work on mobile devices. With MobileVD, users can create and edit videos anytime and anywhere, opening up new possibilities for content creation in areas like social media, entertainment, and education.
Abstract
Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/