LikeBench: Evaluating Subjective Likability in LLMs for Personalization

Md Awsafur Rahman, Adam Gabrys, Doug Kang, Jingjing Sun, Tian Tan, Ashwin Chandramouli

2025-12-18

LikeBench: Evaluating Subjective Likability in LLMs for Personalization

Summary

This paper introduces a new way to test how well AI chatbots, specifically large language models (LLMs), can create conversations that people actually *like*, not just conversations that are factually correct.

What's the problem?

Currently, how we judge if an LLM is 'good' focuses on whether it remembers things you tell it and uses that information correctly. However, a chatbot can be accurate but still feel unpleasant to talk to – maybe it's too formal, doesn't understand your sense of humor, or just feels unnatural. Existing tests don't really measure this 'likability' aspect, which is super important for a good user experience.

What's the solution?

The researchers created a testing framework called LikeBench. It simulates a back-and-forth conversation between an LLM and a virtual person with a detailed personality. As they chat, the LLM tries to learn what the virtual person prefers in terms of things like how emotional the responses are, how formal they should be, and even if the LLM should try to be funny. After each response, the virtual person rates how much they liked it based on seven different qualities. This allows them to pinpoint *why* a model is or isn't likable. They found that remembering facts well doesn't automatically mean the chatbot will be enjoyable to interact with.

Why it matters?

This work is important because it highlights that building truly helpful and engaging AI assistants requires more than just accuracy. It’s crucial for these models to adapt to our individual preferences and communication styles to create a positive user experience. By providing a way to specifically measure and improve 'likability', this research can help developers build chatbots that people genuinely want to talk to.

Abstract

A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on two axes: accurately recalling user information and accurately applying remembered information in downstream tasks. We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks. To measure likability holistically, we introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions by how much an LLM can adapt over time to a user's preferences to provide more likable responses. In LikeBench, the LLMs engage in conversation with a simulated user and learn preferences only from the ongoing dialogue. As the interaction unfolds, models try to adapt to responses, and after each turn, they are evaluated for likability across seven dimensions by the same simulated user. To the best of our knowledge, we are the first to decompose likability into multiple diagnostic metrics: emotional adaptation, formality matching, knowledge adaptation, reference understanding, conversation length fit, humor fit, and callback, which makes it easier to pinpoint where a model falls short. To make the simulated user more realistic and discriminative, LikeBench uses fine-grained, psychologically grounded descriptive personas rather than the coarse high/low trait rating based personas used in prior work. Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86%, 17 facts/profile), outperformed Qwen3 by 28% on likability score despite Qwen3's higher memory accuracy (93%, 43 facts/profile). Even SOTA models like GPT-5 adapt well in short exchanges but show only limited robustness in longer, noisier interactions.

View Paper