Active Intelligence in Video Avatars via Closed-loop World Modeling

Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen

2025-12-24

Active Intelligence in Video Avatars via Closed-loop World Modeling

Summary

This paper introduces a new way to create video avatars that aren't just puppets mimicking movements, but can actually think and act on their own to achieve goals in a virtual world.

What's the problem?

Existing video avatars are really good at *looking* like someone and moving like them, but they're essentially just reacting to instructions. They can't plan ahead, adapt to unexpected changes in their environment, or independently decide what to do to accomplish a task. They lack true intelligence and the ability to pursue long-term objectives.

What's the solution?

The researchers developed a system called ORCA, which stands for Online Reasoning and Cognitive Architecture. ORCA gives avatars an 'internal world model' – basically, a way to understand their surroundings, predict what will happen if they take an action, and then check if their predictions were correct. It works in a cycle of observing, thinking, acting, and reflecting. It also uses a two-part 'brain': one part for high-level planning and another for translating those plans into specific actions the avatar can perform. This allows the avatar to continuously update its understanding of the world and adjust its behavior to succeed at tasks.

Why it matters?

This work is important because it's a step towards creating truly intelligent virtual characters. Instead of just being animated figures, these avatars could potentially assist us in virtual environments, act as realistic training partners, or even become more engaging characters in games and simulations. It moves avatars from being passively animated to actively and intelligently behaving agents.

Abstract

Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.

View Paper