Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei

2025-02-17

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of
Video Foundation Model

Summary

This paper talks about Step-Video-T2V, a new AI model that can create videos from text descriptions. It's a big step forward in making AI-generated videos that look realistic and match what people ask for in their text prompts.

What's the problem?

Creating high-quality videos from text descriptions is really hard for AI. Previous models struggled to make long videos that looked good and matched the text descriptions accurately, especially in different languages.

What's the solution?

The researchers created Step-Video-T2V, which uses several clever techniques. It has a special Video-VAE that compresses videos efficiently, bilingual text encoders for English and Chinese, and a 3D attention system to make sure the videos look smooth and consistent. They also used a method called Video-DPO to make the videos look better and reduce mistakes.

Why it matters?

This matters because it could change how we create videos in the future. Imagine being able to describe any video you want and have an AI create it for you, whether you speak English or Chinese. It could help filmmakers, advertisers, and anyone who needs to create videos quickly. By making their work public, the researchers are helping other scientists improve this technology even further, which could lead to even more amazing video creation tools in the future.

Abstract

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

View Paper