EgoNormia: Benchmarking Physical Social Norm Understanding
MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang
2025-03-03
Summary
This paper talks about EgoNormia, a new dataset and test designed to see how well AI models understand social norms in real-world situations by analyzing videos of human interactions.
What's the problem?
AI systems often struggle to understand and follow social norms, especially in physical and social contexts. This is a big issue for AI that interacts with humans because it can lead to unsafe or inappropriate behavior. Current AI models are not trained well enough to handle these challenges, as shown by their low performance compared to humans.
What's the solution?
The researchers created EgoNormia, a dataset with 1,853 videos showing human interactions from a first-person perspective. Each video comes with questions about what actions should be taken and why. They used this dataset to evaluate how well AI models understand norms like safety, privacy, politeness, and cooperation. They also showed that adding retrieval-based methods (using examples from similar situations) can improve the AI's ability to reason about norms.
Why it matters?
This matters because it highlights the gap between human and AI understanding of social norms, with humans scoring 92% on the test while the best AI only scored 45%. By improving AI's ability to understand and follow norms, this research could make AI safer and more reliable in real-world applications like robotics or virtual assistants.
Abstract
Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia |epsilon|, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.