3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai
2026-02-04
Summary
This paper introduces a new way to control the movements of people in videos created by artificial intelligence, aiming for more realistic and flexible motion than current methods allow.
What's the problem?
Currently, controlling motion in AI-generated videos relies on either simple 2D outlines of poses, which don't work well when changing the camera angle, or detailed 3D models of the human body. These 3D models aren't perfect and can actually limit the AI's ability to create natural-looking movements because they force the AI to stick to potentially inaccurate constraints. Essentially, existing methods either lack flexibility or introduce inaccuracies.
What's the solution?
The researchers developed a system called 3DiMo that learns to represent motion in a 3D-aware way without relying heavily on precise 3D models. It works by taking video frames of a desired motion and converting them into 'motion tokens' – a compact way to describe the movement. These tokens are then fed into a pre-existing AI video generator, guiding the motion. They trained the system using videos from many different viewpoints and initially used a 3D model to get started, but gradually reduced its influence so the AI could learn genuine 3D motion understanding on its own.
Why it matters?
This research is important because it allows for more realistic and controllable human motion in AI-generated videos. It overcomes the limitations of previous methods, enabling flexible camera angles and text-based control over the movements, ultimately leading to higher quality and more believable videos.
Abstract
Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.