DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Yifan Wang, Bolian Li, Junlin Wu, Zhaoxuan Tan, Zheli Liu, Ruqi Zhang, Ananth Grama, Qingkai Zeng

2025-10-08

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Summary

This paper introduces a new way to improve large language models, like those used in chatbots or coding assistants, by focusing on how users *don't* like responses rather than just when they do.

What's the problem?

Typically, training these models to give better answers relies on people explicitly saying what they like, which is expensive and doesn't happen very often. However, users often show their dissatisfaction by rephrasing their questions, correcting the model, or generally trying different approaches until they get a good answer. Existing methods don't effectively use this common 'dissatisfaction' data, instead needing lots of positive examples.

What's the solution?

The researchers developed a technique called DRIFT, which stands for Dissatisfaction-Refined Iterative preFerence Training. DRIFT learns from the signals users give when they're *not* happy with an answer. It also cleverly creates positive examples by using the model's own improving responses as it learns. Essentially, it uses the model's progress to help itself get better.

Why it matters?

DRIFT allows language models to be improved using the huge amount of data already generated by users interacting with them, making the training process more efficient and effective. The results show significant improvements in performance, even surpassing some state-of-the-art models, and it helps the model explore a wider range of good solutions instead of getting stuck on just a few.

Abstract

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce DRIFT (Dissatisfaction-Refined Iterative preFerence Training), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world WildFeedback datasets and synthetic UltraFeedback datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

View Paper