Move-in-2D: 2D-Conditioned Human Motion Generation

Hsin-Ping Huang, Yang Zhou, Jui-Hsien Wang, Difan Liu, Feng Liu, Ming-Hsuan Yang, Zhan Xu

2024-12-20

Move-in-2D: 2D-Conditioned Human Motion Generation

Summary

This paper talks about Move-in-2D, a new method for generating realistic human motion sequences based on 2D images and text descriptions. It aims to create diverse and adaptable human movements that fit different scenes.

What's the problem?

Generating human movements in videos has been difficult because existing methods usually rely on pre-recorded motion sequences from other videos. This limits the types of movements that can be created and often requires matching the background scene to those specific motions, which can be restrictive and not very flexible.

What's the solution?

Move-in-2D solves this problem by using a diffusion model that takes both a scene image and a text prompt as inputs. This allows the system to generate unique motion sequences that are specifically tailored to the provided scene. The researchers collected a large dataset of videos featuring single-human activities and annotated them with corresponding motions, enabling the model to learn how to create movements that align well with the scene images. The method allows for more creativity and adaptability in generating human motions.

Why it matters?

This research is important because it expands the possibilities for creating realistic human animations in various applications, such as video games, movies, and virtual reality. By enabling AI to generate human movement based on simple images and descriptions, it makes the process of creating dynamic content easier and more accessible.

Abstract

Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.

View Paper