OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang

2025-02-04

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human
Animation Models

Summary

OmniHuman-1 is an advanced AI model developed by ByteDance that can create realistic human videos from just a single image and motion signals, like audio or video. It uses cutting-edge technology to animate lifelike gestures, expressions, and movements, making it a powerful tool for digital media and entertainment.

What's the problem?

Existing methods for generating human animations struggle to scale up and often produce unrealistic results. These systems usually rely on limited input types, like audio or pose data, and discard valuable training data that doesn’t fit strict criteria. This limits their ability to create flexible, high-quality videos for real-world applications.

What's the solution?

OmniHuman-1 introduces a new framework based on Diffusion Transformers and multimodal motion conditioning. It combines various input signals—like audio, video, and poses—during training to create more realistic animations. By using an omni-conditions training strategy, it maximizes the use of all available data, even weaker signals like audio-only inputs. This approach enables the model to generate lifelike full-body animations with synchronized speech, gestures, and body movements from minimal input.

Why it matters?

OmniHuman-1 represents a major breakthrough in AI-driven human animation. It not only produces highly realistic videos but also offers flexibility in inputs and supports diverse applications like virtual influencers, gaming, education, and storytelling. This technology sets a new standard for creating lifelike animations and opens up possibilities for more immersive digital experiences while addressing limitations of earlier models.

Abstract

End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)

View Paper