Wan-S2V: Audio-Driven Cinematic Video Generation
Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou
2025-08-27
Summary
This paper introduces a new computer model, called Wan-S2V, designed to create more realistic character animations from audio, like speech or music.
What's the problem?
Current technology for making characters move with audio works well for simple things like someone talking or singing, but it struggles with the complexity of real movies and TV shows. These productions need characters to interact naturally, move believably, and fit within dynamic camera angles, which existing methods can't consistently achieve.
What's the solution?
The researchers built Wan-S2V, which improves upon a previous model called Wan. They tested it extensively, comparing its performance to other leading animation models like Hunyuan-Avatar and Omnihuman. The results showed Wan-S2V consistently created more expressive and accurate animations, especially in situations mimicking real film production.
Why it matters?
This research is important because it brings us closer to being able to automatically generate high-quality character animation for movies and television. This could significantly reduce the time and cost associated with creating animated content, and also open up new possibilities for creative storytelling and video editing.
Abstract
Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.