The FrankenMotion model is a transformer-based diffusion model that can be input conditioned on sequence-level, action-level, and part-level prompts. After training with paired data of motion and structured multi-granularity text annotations, it learns the essential motion elements and how to compose them into complex motions. The model outperforms previous baseline models adapted and retrained for the same setting, and can compose motions unseen during training.
The Frankenstein dataset is the largest dataset providing hierarchical, temporally-aware annotations for 3D human motion, featuring high-quality, diverse motion annotations generated automatically using the FrankenAgent. The dataset captures sequence-level, action-level, and part-level information, enabling the model to learn and generate complex motions with both spatial and temporal control. Ablation studies highlight the importance of hierarchical conditioning, demonstrating the degradation of motion quality as conditioning layers are removed.


