Towards Interactive Intelligence for Digital Humans

Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, Dianwen Ng, Zixiong Su, Erwin Wu, Yuhan Wu, Dingkun Yan, Tianyu Yan, Chang Zeng, Bo Zheng, You Zhou

2025-12-16

Towards Interactive Intelligence for Digital Humans

Summary

This paper introduces a new approach to creating digital humans, called Interactive Intelligence, that aims to make them more realistic and capable of truly interacting with people, not just mimicking human behavior.

What's the problem?

Existing digital humans often feel artificial because they lack consistent personality, can't adapt well to conversations, and don't really 'learn' or evolve over time. They're good at *looking* human, but not at *being* human in an interactive sense.

What's the solution?

The researchers developed a system called Mio, which is like a complete package for building these interactive digital humans. Mio has five parts working together: a 'Thinker' for reasoning, a 'Talker' for generating speech, 'Face Animator' and 'Body Animator' to create realistic movements, and a 'Renderer' to put it all together visually. This system allows the digital human to respond in a way that feels natural and consistent with its personality. They also created a new way to test how well these digital humans perform.

Why it matters?

This work is important because it moves digital humans beyond just being visual effects. It's a step towards creating virtual characters that can genuinely engage with us, which has huge potential for things like education, entertainment, and even providing companionship.

Abstract

We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

View Paper