The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu
2025-10-31
Summary
This paper focuses on improving how well computers can create realistic 3D human movements. Current methods struggle to create movements that aren't similar to those they've already seen, limiting their usefulness in new situations.
What's the problem?
Existing 3D motion generation models don't generalize well to new or unseen movements. While video generation models are really good at creating realistic human actions in videos, 3D motion models haven't been able to take advantage of those advances. Essentially, creating believable movement in 3D is harder than creating believable movement in video, and there's a gap in transferring knowledge between the two.
What's the solution?
The researchers created a system that borrows ideas from video generation to improve 3D motion creation. They did this in three main ways: first, they built a huge dataset called ViMoGen-228K with lots of motion data from real people, videos, and even computer-generated sources. Second, they developed a new model called ViMoGen, which combines information from both real motion capture data and video generation models. They also made a simplified version, ViMoGen-light, that's faster. Finally, they created a new way to test these models, called MBench, that looks at how realistic, accurate, and adaptable the generated motions are.
Why it matters?
This work is important because it significantly improves the quality and realism of computer-generated human movements. Better 3D motion generation has applications in areas like animation, virtual reality, robotics, and even creating more realistic characters in video games. By bridging the gap between video and 3D motion, this research opens up possibilities for more versatile and believable virtual human behavior.
Abstract
Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.