Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos
Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram
2025-12-05
Summary
This paper introduces a new way to judge how realistic videos of people moving are, focusing on whether the movements look natural and correct.
What's the problem?
Currently, it's really hard to automatically tell if a video generated by AI of a person doing something looks believable. Existing methods rely too much on how things *look* visually and don't understand the underlying physics and natural flow of human motion, meaning they miss subtle but important errors in how a person moves or if their body positions are even possible.
What's the solution?
The researchers created a metric that learns what 'normal' human movement looks like by combining information about a person’s skeleton (how their bones connect and move) with visual information. They then compare a generated video to this understanding of normal movement, measuring how far off it is. Essentially, it checks if the movements are physically plausible and smooth, not just if they *appear* okay.
Why it matters?
This work is important because it provides a much better way to evaluate AI-generated videos of people, significantly outperforming previous methods. This pushes the field forward by identifying weaknesses in current AI models and setting a higher standard for creating truly realistic and believable videos of human actions, which is crucial for applications like virtual reality, animation, and robotics.
Abstract
Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.