INTIMA: A Benchmark for Human-AI Companionship Behavior
Lucie-Aimée Kaffee, Giada Pistilli, Yacine Jernite
2025-08-22
Summary
This paper investigates how well AI chatbots, specifically large language models, can act as companions to people, and whether they do so in a healthy way.
What's the problem?
As people start forming emotional connections with AI companions, it's becoming clear that these AIs can sometimes respond in ways that are either *too* friendly and encouraging of attachment, or not firm enough in setting boundaries. This is a problem because a good companion needs to offer support *and* help you maintain a healthy sense of self and independence. There wasn't a good way to systematically test and compare how different AI models handle these emotionally sensitive interactions.
What's the solution?
The researchers created a benchmark called INTIMA, which is a set of 368 specific questions and prompts designed to test 31 different behaviors related to companionship. These behaviors are grouped into four categories. They then used INTIMA to test four different AI models – Gemma-3, Phi-4, o3-mini, and Claude-4 – and categorized the AI’s responses as either encouraging attachment, setting healthy boundaries, or being neutral. This allowed them to see which models leaned more towards one type of response than another.
Why it matters?
The findings show that most AI models currently tend to be overly supportive and encouraging of emotional attachment, and that different companies building these AIs prioritize different aspects of companionship. This is important because it highlights the need for developers to focus on creating AI companions that can provide emotional support *while also* helping users maintain healthy boundaries, ultimately contributing to user well-being.
Abstract
AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.