Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang
2025-08-12
Summary
This paper talks about Temporal Self-Rewarding Language Models, which are AI models that get better at generating text by using their own past and future outputs to learn which responses are preferred. This helps the models improve how they understand and produce language, especially for new or different kinds of inputs.
What's the problem?
The problem is that language models often struggle to learn preferences effectively or generalize well when they encounter text that is very different from what they saw before. They can have difficulty deciding which outputs are better or how to improve based on experience over time, especially when considering sequences of responses.
What's the solution?
The paper proposes a method that separates chosen outputs from rejected ones by looking at both past and future outputs of the model. By doing this, the model can better learn from its own experience by rewarding good choices and avoiding bad ones over time, improving how it learns preferences and handles new, unseen situations during generation.
Why it matters?
This matters because better preference learning and generalization mean AI models can produce higher quality, more accurate, and more natural text. It helps models work better in real-world tasks where the input might be different or unusual, making AI assistants, chatbots, and other language tools more reliable and useful.
Abstract
Temporal Self-Rewarding Language Models improve generative capabilities by strategically using past and future model outputs, enhancing preference learning and out-of-distribution generalization.