Waver: Wave Your Way to Lifelike Video Generation
Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Zehuan Yuan, Bingyue Peng
2025-08-22
Summary
This paper introduces Waver, a new artificial intelligence model that can create realistic videos from either text descriptions or still images, and even generate images from text. It's designed to be a single system that handles all these tasks, making it a versatile tool for video and image creation.
What's the problem?
Creating high-quality videos with AI is really hard. Existing models often struggle with making videos that look natural, have consistent motion, and are long enough to be useful. Also, it's difficult to build a single model that can do everything – turn text into video, images into video, and text into images – without sacrificing quality. A big issue is also ensuring the data used to train these models is actually good quality, as bad data leads to bad results.
What's the solution?
The researchers built Waver using a new architecture called Hybrid Stream DiT, which helps the model understand and combine different types of information (like text and images) more effectively and speeds up the learning process. They also created a careful system to filter the training data, using another AI model to identify and remove low-quality videos. Finally, they’ve shared detailed instructions on how to train and use Waver so others can build upon their work.
Why it matters?
Waver represents a significant step forward in AI video generation. It produces videos that are competitive with, and sometimes even better than, existing commercial options, and it’s available as an open-source project. This means anyone can use and improve upon it, potentially accelerating advancements in video creation technology and making it more accessible to everyone.
Abstract
We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.