MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms
Ling-Hao Chen, Wenxun Dai, Xuan Ju, Shunlin Lu, Lei Zhang
2024-10-25

Summary
This paper introduces MotionCLR, a new system that allows for easy generation and editing of human motions using advanced attention mechanisms, making the process more intuitive and flexible.
What's the problem?
Generating and editing human motion in videos or animations can be difficult because traditional models often struggle to accurately link text descriptions to specific movements. This lack of clarity makes it hard to edit motions precisely, leading to frustrating results when trying to create or modify animations.
What's the solution?
MotionCLR uses an attention-based approach to better understand how different parts of a motion relate to each other and to the words describing them. It employs two types of attention: self-attention, which looks at how frames relate to one another, and cross-attention, which connects words in the text to specific movements. This allows users to easily manipulate motions by adjusting attention maps, enabling features like emphasizing or replacing certain movements without needing additional training.
Why it matters?
This research is significant because it simplifies the process of creating and editing animations, making it more accessible for artists and animators. By allowing for intuitive control over motion generation, MotionCLR could greatly enhance the efficiency and creativity in fields like animation, gaming, and virtual reality.
Abstract
This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.