ReMoMask: Retrieval-Augmented Masked Motion Generation

Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang

2025-08-05

ReMoMask: Retrieval-Augmented Masked Motion Generation

Summary

This paper talks about ReMoMask, a new framework that improves how AI generates human-like motions from text descriptions by combining different advanced techniques.

What's the problem?

The problem is that text-to-motion generation models often struggle to create realistic and smooth movements that perfectly match what the text describes.

What's the solution?

ReMoMask solves this by using a special model called the Bidirectional Momentum Text-Motion Model that looks at the motion from both directions to understand it better, adding attention mechanisms that focus on important details over space and time, and using a type of guidance that removes dependency on classification to improve the motion quality.

Why it matters?

This matters because it advances the ability of AI to create lifelike animations and motions from simple text inputs, which can be useful in video games, virtual reality, robotics, and other areas where realistic human motion is important.

Abstract

ReMoMask, a unified framework, addresses limitations in text-to-motion generation by integrating a Bidirectional Momentum Text-Motion Model, Semantic Spatio-temporal Attention, and RAG-Classier-Free Guidance, achieving state-of-the-art performance on HumanML3D and KIT-ML benchmarks.

View Paper