On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, Seulki Lee

2025-02-10

On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for
Mobile Devices

Summary

This paper talks about On-device Sora, a new technology that allows smartphones to create videos from text descriptions without needing powerful computers or internet connections.

What's the problem?

Creating videos from text descriptions usually requires a lot of computing power and memory, which smartphones don't have. This means people typically need to use powerful computers or rely on internet-based services to generate videos, which can be slow, expensive, and raise privacy concerns.

What's the solution?

The researchers developed three clever tricks to make video generation work on smartphones: Linear Proportional Leap (LPL) to reduce the number of steps needed to create a video, Temporal Dimension Token Merging (TDTM) to simplify the video processing, and Concurrent Inference with Dynamic Loading (CI-DL) to manage memory more efficiently. They tested these methods on an iPhone 15 Pro and found that it could create high-quality videos similar to those made by more powerful computers.

Why it matters?

This technology matters because it makes advanced video creation accessible to anyone with a smartphone. It protects user privacy by keeping everything on the device, reduces the need for expensive cloud services, and could lead to new creative tools for mobile users. It's a big step towards making cutting-edge AI video generation available to everyone, not just those with access to powerful computers.

Abstract

We present On-device Sora, a first pioneering solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. Building on Open-Sora, On-device Sora applies three novel techniques to address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations demonstrate that it is capable of generating high-quality videos on the device, comparable to those produced by Open-Sora running on high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices, expanding accessibility, ensuring user privacy, reducing dependence on cloud infrastructure, and lowering associated costs. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation capabilities on commodity mobile and embedded devices. The code implementation is publicly available at an GitHub repository: https://github.com/eai-lab/On-device-Sora.

View Paper