MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment
Zhiting Gao, Dan Song, Diqiong Jiang, Chao Xue, An-An Liu
2025-08-28
Summary
This paper introduces a new system for creating realistic animations of virtual characters based on text descriptions, focusing on making the animations both accurate to the text and fast to generate.
What's the problem?
Current methods for turning text into animation often struggle to perfectly match the described actions, especially subtle details, and they can be very slow because they require many computational steps to create even a short animation sequence. Essentially, it's hard to get the animation to *exactly* do what the text says, and it takes a long time to do it.
What's the solution?
The researchers developed two main components: TAPO and MotionFLUX. TAPO improves how the system understands the connection between words and specific movements, making the animations more precise. MotionFLUX is a new technique for generating animations quickly by finding the most direct path from random noise to a finished animation, instead of slowly refining the animation step-by-step like older methods. It's like taking a shortcut instead of a winding road.
Why it matters?
This work is important because it allows for more realistic and responsive virtual characters in games, movies, and robotics. By improving both the accuracy and speed of animation generation, it opens the door to creating more immersive and interactive experiences, and potentially allows for real-time control of characters based on spoken commands.
Abstract
Motion generation is essential for animating virtual characters and embodied agents. While recent text-driven methods have made significant strides, they often struggle with achieving precise alignment between linguistic descriptions and motion semantics, as well as with the inefficiencies of slow, multi-step inference. To address these issues, we introduce TMR++ Aligned Preference Optimization (TAPO), an innovative framework that aligns subtle motion variations with textual modifiers and incorporates iterative adjustments to reinforce semantic grounding. To further enable real-time synthesis, we propose MotionFLUX, a high-speed generation framework based on deterministic rectified flow matching. Unlike traditional diffusion models, which require hundreds of denoising steps, MotionFLUX constructs optimal transport paths between noise distributions and motion spaces, facilitating real-time synthesis. The linearized probability paths reduce the need for multi-step sampling typical of sequential methods, significantly accelerating inference time without sacrificing motion quality. Experimental results demonstrate that, together, TAPO and MotionFLUX form a unified system that outperforms state-of-the-art approaches in both semantic consistency and motion quality, while also accelerating generation speed. The code and pretrained models will be released.