Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie

2026-02-25

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Summary

This paper introduces a new way to test how well AI models can give good stock advice, going beyond simply checking if the advice matches what users actually did.

What's the problem?

Current methods for evaluating recommendation systems focus on whether the AI predicts what a user *will* do. However, in finance, people don't always make the best decisions – they might panic sell during a market drop or chase trends. Just copying user behavior doesn't mean the AI is giving *good* advice, it just means it's mimicking potentially flawed choices. It's hard to tell if an AI is making smart, long-term recommendations or just echoing short-sighted actions.

What's the solution?

The researchers created a benchmark called Conv-FinRe. This benchmark gives the AI a 'user profile' through a simulated interview, provides information about how the market is changing over time, and then asks the AI to give investment recommendations in a conversation. What’s special is that the benchmark doesn’t just check if the AI’s advice matches what the user *would* have done; it also checks if the advice is actually *good* based on the user’s risk tolerance and long-term financial goals. They used real market data and conversations created by people to build this benchmark.

Why it matters?

This work is important because it highlights that simply mimicking user behavior isn't enough when giving financial advice. It shows that AI models can sometimes give good advice even if it's different from what a person would immediately choose, and vice versa. This helps researchers build AI systems that truly help people make better investment decisions, rather than just reinforcing their existing habits, even if those habits aren't optimal.

Abstract

Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

View Paper