SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji

2026-03-18

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Summary

This paper introduces a new way to test how well artificial intelligence, specifically large language models that can understand text, images, and sound, can handle conversations like a human. It focuses on the social aspects of talking, not just whether the AI gets the facts right.

What's the problem?

Current tests for these AI models mostly check if they can accurately process information. They don't really evaluate how well the AI can participate in a natural back-and-forth conversation, like knowing when to interrupt, who is speaking, or how to interrupt politely. This is a big problem because good conversation isn't just about being correct, it's about social awareness and timing.

What's the solution?

The researchers created a benchmark called SocialOmni. This benchmark tests AI on three key areas of conversational interaction: figuring out who is talking, deciding when it's appropriate to jump into a conversation, and actually *how* to interrupt in a natural way. They created a large dataset with 2,000 examples and tested 12 different AI models using this benchmark, even adding confusing audio and visual elements to see how robust the models are.

Why it matters?

This work is important because it shows that simply being able to understand information isn't enough for an AI to be a good conversational partner. The tests revealed that AI models often struggle with the social nuances of conversation, even if they seem to understand the content. The results can help developers build better AI that can interact with people more naturally and effectively.

Abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

View Paper