Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab

2025-10-29

Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Summary

This paper investigates how well large language models, the AI behind things like chatbots, understand language that's deeply connected to culture, specifically sayings and idioms. It focuses on whether these models can not only *understand* what these phrases mean, but also *use* them correctly in different situations.

What's the problem?

Large language models are trained on massive amounts of text, but a lot of that text doesn't fully capture the nuances of different cultures. This means the models might struggle with figurative language – like idioms and proverbs – because understanding them requires knowing the cultural context. The researchers wanted to see if LLMs could handle culturally specific language as well as more straightforward language, and pinpoint exactly where they fall short.

What's the solution?

The researchers created tests using Arabic and English proverbs and, importantly, Egyptian Arabic idioms. They tested 22 different language models, some publicly available and some proprietary, on these tests. The tests checked if the models could understand the meaning of the phrases, use them appropriately in context, and grasp the subtle feelings or connotations associated with them. They also built a new dataset of Egyptian Arabic idioms called Kinayat to help other researchers study this problem.

Why it matters?

This research is important because it shows that while language models are getting better at processing language, they still have a long way to go when it comes to understanding cultural context. If we want AI to truly communicate with people from all over the world, it needs to be able to understand and use language in a culturally sensitive way. This work highlights the need for more research and better datasets focused on culturally grounded language, and provides a valuable resource for future studies.

Abstract

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.

View Paper