Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, Yulia Tsvetkov

2025-10-06

Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

Summary

This paper explores a weakness in how we build large language models (LLMs) – they’re really good at getting the *right* answer, but not necessarily at giving you the answer *you* want. It introduces a new way to test if LLMs can figure out what you need, even if you don’t tell them directly.

What's the problem?

Currently, LLMs are trained in two steps: first to be factually correct, and then to match general human preferences. This doesn’t work well when a user has specific needs that aren’t reflected in those general preferences, especially when the LLM has no prior information about the user. Imagine asking a question and getting a technically correct answer that’s way too complicated for your understanding – the LLM didn’t bother to figure out your level of expertise. This is especially hard when starting a conversation 'cold' with no past interactions.

What's the solution?

The researchers created a testing method called PREFDISCO. It uses detailed 'personas' – basically, fictional people with specific backgrounds and preferences – to simulate real users. The same question can have different 'right' answers depending on the persona asking it. This forces the LLM to not just give a correct answer, but to *figure out* what kind of answer the persona would find most helpful. They tested 21 different LLMs on 10 different tasks using this method.

Why it matters?

The results showed that most LLMs actually perform *worse* when trying to personalize responses compared to just giving a standard answer. This means that adapting to individual users isn’t something LLMs automatically get better at; it requires specific development. This research highlights the need for LLMs that can actively learn about your preferences through questioning and adjust their responses accordingly, which is crucial for applications like education, healthcare, and technical support where a one-size-fits-all approach doesn’t work.

Abstract

Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.

View Paper