Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu
2026-03-20
Summary
This paper presents a new way to create realistic and controllable human motions by combining the best parts of two different approaches: one that's good at smooth, flowing movements and another that's good at understanding what the movement *should* be doing.
What's the problem?
Existing methods for generating human motion either excel at creating natural-looking movements but struggle with following specific instructions, or they're good at following instructions but the resulting motions can look stiff or unnatural. It's hard to get both smooth, realistic motion *and* precise control over what the motion is accomplishing. Also, adding more detailed control often makes the motion quality worse.
What's the solution?
The researchers developed a three-step process. First, they analyze the desired motion. Second, they plan the motion using a system that creates concise 'tokens' representing the key actions. Crucially, these tokens are generated using a diffusion model, which helps keep them compact while still capturing the essence of the movement. Finally, they use another diffusion model to turn these tokens into a detailed, smooth motion. They also cleverly separate how the system handles broad instructions versus fine details, preventing the details from messing up the overall plan. This new system, called MoTok, uses a diffusion-based method to create these motion tokens.
Why it matters?
This work is important because it significantly improves the quality and control of generated human motions. It achieves better results than previous methods, creating more realistic movements with fewer data points needed to represent them. It also maintains, and even improves, motion quality even when given very specific and detailed instructions, something other methods struggle with. This could be useful for creating more realistic characters in video games, animation, or robotics.
Abstract
Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.