HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu

2025-09-12

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Summary

This paper introduces a new system called HuMo that creates realistic videos of people doing things based on different kinds of instructions – text descriptions, example images, and audio. It's about making video generation more controllable and realistic by combining these different inputs.

What's the problem?

Creating videos from multiple inputs like text, images, and audio is really hard. The biggest issues are that there isn't much training data available that has all three types of information paired together, and it's difficult to make sure the generated video accurately reflects the person described, and that the video's movements match the audio perfectly. Existing methods often struggle with either having enough data to learn from, or coordinating all the different parts of the video to make it look natural.

What's the solution?

The researchers tackled these problems in two main ways. First, they created a new, high-quality dataset with lots of examples of text, images, and audio all linked together. Second, they developed a training process that happens in stages. In the first stage, they focused on keeping the person in the video looking like they should, using a technique that doesn't drastically change the underlying video generation model. In the second stage, they worked on syncing the video with the audio, not just by directly connecting the audio to the video, but also by having the model *predict* where the audio should affect the facial movements. Finally, they designed a way to smoothly blend the influence of each input (text, image, audio) during video creation, giving more control over the final result.

Why it matters?

This work is important because it moves the field of video generation closer to being able to create videos that are truly tailored to specific instructions. By combining text, images, and audio in a more effective way, HuMo can generate more realistic and controllable videos than previous methods. This has potential applications in areas like creating personalized content, special effects, and even helping people communicate more effectively.

Abstract

Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

View Paper