SVBench: Evaluation of Video Generation Models on Social Reasoning
Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang
2025-12-29
Summary
This paper investigates how well current AI models can create videos that show realistic *social* interactions, meaning interactions that make sense in terms of how people think, feel, and behave with each other.
What's the problem?
While AI can now generate videos that *look* pretty real, the AI struggles to understand the underlying reasons *why* people do things. Humans easily understand intentions, beliefs, and social rules just by watching a short scene, but AI often creates videos that are literally correct but lack this deeper social understanding. Essentially, the AI can show people doing things, but not *why* they're doing them in a way that aligns with how people actually act.
What's the solution?
The researchers created a new way to test this ability, inspired by psychology experiments that study how humans develop social understanding. They took thirty classic experiments and organized them into seven key areas of social reasoning, like understanding what someone else is thinking or knowing how to cooperate. They then built a system that creates video scenarios and uses another AI to judge how well the generated videos demonstrate these social skills, focusing on things like recognizing intentions and understanding social norms.
Why it matters?
This research is important because it shows that even though AI video generation is improving, it's still far from being able to create truly believable and realistic human interactions. Identifying these gaps helps researchers focus on improving the AI's ability to understand and represent the complexities of social behavior, which is crucial for applications like creating realistic virtual characters or training AI to interact with people in a natural way.
Abstract
Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.