InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles
Zizhen Li, Chuanhao Li, Yibin Wang, Qi Chen, Diping Song, Yukang Feng, Jianwen Sun, Jiaxin Ai, Fanrui Zhang, Mingzhu Sun, Kaipeng Zhang
2025-08-25
Summary
This paper investigates how well large language models (LLMs) understand and mimic the different ways people think and reason in social situations, specifically when trying to figure out who's telling the truth and who's lying.
What's the problem?
Current tests of LLMs' social reasoning skills often focus on whether they can *generally* understand intentions or detect lies, but they don't account for the fact that people have unique reasoning styles. Everyone approaches social situations differently, and these differences matter. It's hard to tell if an LLM truly understands social dynamics if it can't adapt to these individual approaches.
What's the solution?
The researchers created a new testing framework called InMind, which uses social deduction games like Avalon as a testing ground. In these games, players have hidden roles and try to figure out who the 'bad guys' are. InMind doesn't just look at whether the LLM guesses correctly, but also *how* it reasons, tracking its strategies throughout the game and asking it to reflect on its thought process from both an outside observer's perspective and as an active player. They tested 11 different LLMs with this framework.
Why it matters?
The findings show that most LLMs, even very powerful ones like GPT-4o, tend to rely on simple clues like word choice and struggle to understand the bigger picture of how a game unfolds over time or adjust their reasoning based on what other players are doing. However, some LLMs designed specifically for reasoning showed some ability to adapt their strategies. This research highlights the need for LLMs to become better at understanding and responding to individual differences in reasoning, which is crucial for creating AI that can interact with humans in a more natural and effective way.
Abstract
LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs' capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human-AI interaction.