Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Jundong Li, Nathan Kallus

2025-11-03

Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

Summary

This paper focuses on improving how large language models, which are good at understanding and generating human-like text, can be used to create better recommendation systems that work through conversations. Imagine chatting with a system to get personalized suggestions – that’s the idea!

What's the problem?

Using these language models for recommendations isn't straightforward. They often suggest items that aren't actually available, don't follow the requested format for recommendations (like a numbered list), and the quality of the suggestions gets worse as the list gets longer. Basically, they're powerful but need to be 'trained' specifically for the task of recommending things well.

What's the solution?

The researchers developed a system called ConvRec-R1 which trains the language model in two steps. First, they create a high-quality dataset of example conversations and recommendations using another powerful language model to show the system what good recommendations look like. Then, they use a special training technique called Rank-GRPO that focuses on improving the ranking of each item in the recommendation list individually, making sure each suggestion is good and the order makes sense. This technique is designed to be more stable and effective than previous methods.

Why it matters?

This work is important because it makes conversational recommendation systems more practical and effective. By addressing the issues of inaccurate suggestions and declining quality, it brings us closer to having AI assistants that can genuinely understand our preferences and provide helpful, personalized recommendations through natural conversation.

Abstract

Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.

View Paper