AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Xilin Jiang, Qiaolin Wang, Junkai Wu, Xiaomin He, Zhongweiyang Xu, Yinghao Ma, Minshuo Piao, Kaiyi Yang, Xiuwen Zheng, Riki Shimizu, Yicong Chen, Arsalan Firoozi, Gavin Mischler, Sukru Samet Dindar, Richard Antonello, Linyang He, Tsun-An Hsieh, Xulin Fan, Yulun Wu, Yuesheng Ma, Chaitanya Amballa, Weixiong Chen

2026-01-28

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Summary

This paper investigates how well artificial intelligence understands the meaning behind popular internet audio and video clips, like memes, going beyond just recognizing what's literally happening in them.

What's the problem?

AI models are really good at processing text, but understanding things like memes requires understanding context, culture, and emotion – things that aren't always explicitly stated. The problem is that current AI struggles to grasp the deeper meaning of audio and video, especially when there isn't much text involved, like with music or sound effects. They often focus on the surface level content instead of *why* something is funny or culturally significant.

What's the solution?

The researchers created a new test called AVMeme Exam. This test includes over a thousand famous internet sounds and videos, and for each one, they wrote questions that check different levels of understanding – from simply identifying what's happening to understanding the cultural context and emotional impact. They then tested several advanced AI models on this exam, comparing their performance to how well humans did.

Why it matters?

This research shows that current AI isn't very 'human-like' in its understanding of multimedia content. It highlights a significant gap in AI's ability to truly understand the world around us, and points to the need for AI models that can perceive and interpret context and culture, not just the literal sounds and images they process. This is important for building AI that can interact with humans in a more meaningful way.

Abstract

Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public

View Paper