Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics
Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang
2025-12-18
Summary
This paper introduces a new method, called TIMAR, for creating more realistic 3D models of heads during conversations. It focuses on making avatars and robots appear more natural when 'talking' and 'listening' to people.
What's the problem?
Currently, computer systems struggle to accurately simulate the back-and-forth nature of human conversation in 3D. Existing methods often treat speaking and listening as separate events, or they look at the entire conversation at once, which makes it hard to maintain a smooth and realistic flow. This results in avatars and robots that don't quite seem to respond in a timely or natural way.
What's the solution?
TIMAR works by considering each turn in a conversation – one person speaking, then the other – and carefully analyzing both the audio and visual cues within that turn. It then uses a special technique called 'causal attention' to remember what happened in previous turns, building up a conversational history. Finally, it uses a 'diffusion head' to predict how the head should move, capturing both coordinated movements and subtle expressions.
Why it matters?
This research is important because it significantly improves the realism of 3D avatars and robots. By creating more natural conversational dynamics, it makes interactions with these systems feel more engaging and intuitive, paving the way for better virtual assistants, more believable characters in games, and more effective human-robot collaboration.
Abstract
Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.