ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Cheng Luo, Bizhu Wu, Bing Li, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen, Bernard Ghanem

2026-03-20

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Summary

This paper focuses on creating realistic movements for a virtual listener responding to a speaker, aiming to make interactions feel more natural.

What's the problem?

It's really hard to get computers to generate believable listener reactions because people don't react the same way every time someone says something. There are many appropriate responses, and simply matching a motion to speech isn't enough to capture the nuance of human interaction. Existing datasets and ways to measure success don't really address this complexity.

What's the solution?

The researchers created a new dataset called ReactMotionNet, which includes many different possible listener movements for each spoken phrase, and ratings of how well each movement fits the speech. They also developed a new system, ReactMotion, that learns to generate these movements by considering the text, audio, emotion in the speech, and is trained to create responses that are both appropriate and varied. They also created new ways to evaluate how good the generated movements are, focusing on whether they *feel* right rather than just matching the input.

Why it matters?

This work is important because more realistic virtual listeners can make interactions with computers and virtual characters much more engaging and natural. This has applications in areas like virtual reality, video games, and creating more helpful virtual assistants.

Abstract

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

View Paper