OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao
2025-08-27
Summary
This paper introduces a new system, OmniHuman-1.5, for creating more realistic and expressive animated characters. It aims to go beyond simply making characters *look* like they're moving correctly, and instead focuses on making their movements actually *feel* right for the situation and what they're 'thinking'.
What's the problem?
Current video avatar models are good at making characters move in a physically believable way, but they often lack emotional depth or understanding of the context. They tend to react to things like the beat of music without really 'understanding' what's happening in the scene or what the character is supposed to be feeling. This results in animations that feel robotic or unnatural, even if they look technically correct.
What's the solution?
The researchers tackled this by using powerful AI language models to create a detailed 'script' for the character's movements, describing not just *what* they should do, but *why*. This script provides high-level guidance to the animation generator. They also developed a special AI architecture, called Multimodal DiT with a Pseudo Last Frame design, to help the system combine information from different sources – like audio, images, and text – and avoid conflicts between them. This allows the model to understand the overall scene and create movements that match the character, the situation, and any spoken words.
Why it matters?
This work is important because it brings us closer to creating truly believable and engaging virtual characters. Better avatars have huge potential in areas like video games, virtual reality, filmmaking, and even personal communication, making interactions with digital characters feel more natural and immersive. The ability to handle complex scenarios with multiple characters or even non-human subjects also expands the possibilities for these technologies.
Abstract
Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive. Our model, OmniHuman-1.5, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: https://omnihuman-lab.github.io/v1_5/