Motion Control for Enhanced Complex Action Video Generation

Qiang Zhou, Shaofeng Zhang, Nianzu Yang, Ye Qian, Hao Li

2024-11-14

Motion Control for Enhanced Complex Action Video Generation

Summary

This paper presents PerceiverS, a new model designed to generate long and expressive symbolic music by effectively managing both the overall structure and detailed nuances of music.

What's the problem?

Generating symbolic music that is both long and expressive is challenging. Many existing models struggle to maintain coherence over longer pieces while also capturing the subtle details that make music feel alive and engaging.

What's the solution?

The authors developed PerceiverS, which uses two key techniques: Effective Segmentation and Multi-Scale attention. This allows the model to learn the long-term structure of music while also focusing on short-term expressive details. By combining different types of attention mechanisms, PerceiverS can create music that is both coherent and varied. The model was tested on datasets like Maestro, showing significant improvements in generating high-quality music.

Why it matters?

This research is important because it advances the field of music generation, allowing for the creation of more complex and emotionally engaging compositions. By improving how machines generate music, this work could enhance applications in entertainment, education, and art, making AI-generated music more appealing to listeners.

Abstract

Existing text-to-video (T2V) models often struggle with generating videos with sufficiently pronounced or complex actions. A key limitation lies in the text prompt's inability to precisely convey intricate motion details. To address this, we propose a novel framework, MVideo, designed to produce long-duration videos with precise, fluid actions. MVideo overcomes the limitations of text prompts by incorporating mask sequences as an additional motion condition input, providing a clearer, more accurate representation of intended actions. Leveraging foundational vision models such as GroundingDINO and SAM2, MVideo automatically generates mask sequences, enhancing both efficiency and robustness. Our results demonstrate that, after training, MVideo effectively aligns text prompts with motion conditions to produce videos that simultaneously meet both criteria. This dual control mechanism allows for more dynamic video generation by enabling alterations to either the text prompt or motion condition independently, or both in tandem. Furthermore, MVideo supports motion condition editing and composition, facilitating the generation of videos with more complex actions. MVideo thus advances T2V motion generation, setting a strong benchmark for improved action depiction in current video diffusion models. Our project page is available at https://mvideo-v1.github.io/.

View Paper