Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues
Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Dogruoz, Najoung Kim, Alice Oh
2025-10-23
Summary
This paper investigates how well large language models, the AI behind things like chatbots, understand social relationships between people just by reading conversations. It focuses on whether these models can correctly guess if two people are friends, family, or something else based on how they talk to each other.
What's the problem?
Current AI models are getting really good at many language tasks, but they struggle with understanding the subtle cues in conversations that reveal how people relate to each other. The researchers noticed that these models perform significantly worse when dealing with conversations in Korean compared to English, and often suggest relationships that don't make sense given the context. This shows a gap in their ability to reason about social dynamics, especially across different cultures.
What's the solution?
To test this, the researchers created a new dataset called SCRIPTS, which contains 1,000 conversations taken from movie scripts in both English and Korean. People then labeled each conversation with how likely different relationships were between the speakers – things like 'highly likely friends' or 'unlikely strangers'. They then tested nine different AI models on this dataset to see how accurately they could predict these relationships. They also tried using techniques like 'chain-of-thought prompting' to see if it would help the models, but it didn't really improve things and sometimes even made the biases worse.
Why it matters?
This research is important because as AI becomes more integrated into our daily lives, it's crucial that it can understand social situations correctly. If an AI can't grasp relationships, it could lead to awkward or even harmful interactions. This study highlights that current AI models are not yet socially intelligent and that more work is needed to build AI that can navigate the complexities of human relationships, especially considering cultural differences.
Abstract
As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models' social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs' social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.