HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin

2024-07-25

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Summary

This paper introduces HumanVid, a new dataset designed to improve human image animation by allowing users to control both the character's movements and the camera angles in videos. It combines high-quality real-world and synthetic data to enhance the animation process.

What's the problem?

Current methods for animating human images often rely on limited datasets that are either hard to access or do not include important camera motion information. This lack of quality training data makes it difficult for models to generate stable and realistic animations, as they usually focus only on 2D human movements without considering how the camera moves in relation to those actions.

What's the solution?

To address these issues, HumanVid provides a large-scale dataset that includes 20,000 high-resolution videos of people, collected from copyright-free sources on the internet. The dataset uses a careful filtering process to ensure video quality and includes annotations for both human poses and camera movements. Additionally, it incorporates synthetic data from 3D avatars to enhance the variety of movements and camera angles available for training. The researchers also developed a model called CamAnimate that utilizes this dataset to effectively control human poses and camera motions, achieving state-of-the-art performance in generating animated videos.

Why it matters?

This research is important because it sets a new standard for human image animation by providing a comprehensive dataset that allows for better training of AI models. By enabling more realistic and controllable animations, HumanVid can significantly benefit fields like film production, video games, and virtual reality, making it easier to create engaging visual content.

Abstract

Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation.To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at https://github.com/zhenzhiwang/HumanVid/.

View Paper