Magic 1-For-1: Generating One Minute Video Clips within One Minute

Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou

2025-02-12

Magic 1-For-1: Generating One Minute Video Clips within One Minute

Summary

This paper talks about Magic 1-For-1 (Magic141), a new AI system that can create one-minute video clips in just one minute. It's designed to be fast and efficient, using clever tricks to make high-quality videos quickly without needing too much computer power.

What's the problem?

Creating videos using AI is usually a slow process that requires a lot of computing power. Most systems struggle to make long videos quickly while keeping the quality high. This makes it hard for people to use AI video generation for real-time applications or on regular computers.

What's the solution?

The researchers split the video-making process into two steps: first creating an image from text, then turning that image into a video. They found this was easier for the AI to learn. They also used several tricks to make the process faster and use less memory, like using a special way to inject information into the model and making the model simpler. With these techniques, they can create 5-second video clips in just 3 seconds, and by using a sliding window method, they can make minute-long videos in about a minute.

Why it matters?

This matters because it could make AI video creation much more accessible and useful for everyday people. Imagine being able to describe a video idea and have it created almost instantly on your computer or phone. This could revolutionize content creation for social media, education, or entertainment. It also shows that AI can be made more efficient, which could lead to better AI tools that don't need super powerful computers to run.

Abstract

In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.

View Paper