GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
Florian Schneider, Carolin Holtermann, Chris Biemann, Anne Lauscher
2025-02-20
Summary
This paper talks about GIMMICK, a new way to test how well AI models understand different cultures from around the world. It's like creating a global cultural quiz for computers to see how much they really know about diverse traditions and customs.
What's the problem?
Current AI models that work with both text and images (called LVLMs) are really good at understanding things from Western cultures, but they're not so great when it comes to other parts of the world. Previous tests for these AIs were limited and didn't cover enough cultures or different aspects of those cultures.
What's the solution?
The researchers created GIMMICK, a big test that covers cultural knowledge from 144 countries across six major world regions. They made six different types of tasks using three new sets of data, which include 728 unique cultural events or aspects. They tested 31 different AI models, both big and small, to see how well they understood these cultural elements. The test looked at things like how biased the AIs were towards Western cultures, how the size of the AI affected its performance, and how well the AIs did when given both images and text versus just one or the other.
Why it matters?
This matters because as AI becomes more common in our lives, we need it to understand and respect all cultures, not just Western ones. GIMMICK shows that current AIs have a strong bias towards Western cultures and struggle with understanding deeper cultural nuances. By identifying these problems, we can work on making AI that's more inclusive and respectful of global diversity. This could lead to better AI assistants, more accurate translation tools, and technology that works well for people all around the world, not just in Western countries.
Abstract
Large Vision-Language Models (<PRE_TAG>LVLMs)</POST_TAG> have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than in<PRE_TAG>tangible aspects</POST_TAG> (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.