Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation
Shaowei Liu, Chuan Guo, Bing Zhou, Jian Wang
2025-10-17
Summary
This paper introduces Ponimator, a new system for creating realistic animations of people interacting with each other, focusing on close-up interactions like hugs or handshakes.
What's the problem?
Animating believable human interactions is really hard. Existing methods often struggle to capture the subtle cues and natural movements that happen when people are close to each other. It's difficult to make animations that *feel* right because they lack the understanding of how people naturally behave in these situations, and often require a lot of manual work.
What's the solution?
The researchers created Ponimator, which uses a technique called diffusion modeling. They trained it on a lot of data of people interacting, specifically focusing on their poses when they're close together. Ponimator actually has two parts: one that takes a starting pose and creates a whole sequence of movements, and another that can create a good starting pose from just a single pose, a text description, or both. This allows it to generate interactions even when you don't have a perfect starting point.
Why it matters?
This work is important because it makes it easier to create realistic and diverse animations of human interactions. It can be used for things like video games, movies, or even virtual reality, making these experiences more immersive and believable. It also shows that understanding how people position themselves when interacting is key to creating natural-looking movement.
Abstract
Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.