BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Sangyeon Yoon, Sunkyoung Kim, Hyesoo Hong, Wonje Jeung, Yongil Kim, Wooseok Seo, Heuiyeen Yeen, Albert No

2026-03-19

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Summary

This paper investigates how well large language models, or LLMs, handle personalizing responses based on what they've learned about a user, specifically when those preferences might be inappropriate for a given situation.

What's the problem?

LLMs are designed to remember your likes and dislikes to give you more tailored responses, but sometimes those preferences shouldn't be used. Imagine an LLM learns you like sarcastic jokes, but you're using it to write a professional email – sarcasm wouldn't be appropriate. The problem is that LLMs aren't very good at understanding *when* it's okay to use these learned preferences and when they should be ignored, potentially leading to awkward or even offensive outputs in certain contexts.

What's the solution?

The researchers created a testing tool called BenchPreS to specifically measure how well LLMs can apply or suppress user preferences depending on the situation. They used two ways to measure performance: how often the LLM incorrectly *used* a preference when it shouldn't have (Misapplication Rate) and how often it correctly *used* a preference when it was appropriate (Appropriate Application Rate). They tested several advanced LLMs and found they consistently struggled with this context-sensitive application of preferences, even when trying to improve them with better reasoning skills or specific instructions.

Why it matters?

This research is important because it shows that current LLMs don't truly understand that personal preferences are flexible and depend on the situation. They tend to treat preferences as strict rules, which can cause problems in real-world communication. Improving this ability is crucial for building LLMs that are not only helpful but also socially aware and appropriate in a variety of settings.

Abstract

Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.

View Paper