Experiential Reinforcement Learning
Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, Jieyu Zhao
2026-02-17
Summary
This paper introduces a new way to train language models using reinforcement learning, focusing on how to make them learn better from limited and delayed feedback.
What's the problem?
Typically, when training language models to perform tasks through trial and error, the feedback they receive is often infrequent and doesn't immediately tell them *why* they failed. This makes it hard for the model to figure out what changes it needs to make to improve, because it has to guess how past mistakes relate to future success. It's like trying to learn a game without knowing what you did wrong after each attempt.
What's the solution?
The researchers developed a method called Experiential Reinforcement Learning (ERL). This involves giving the model a chance to not just *try* something, but also to *reflect* on its attempt after getting feedback. The model essentially writes a short explanation of what happened and how it could do better, then uses that reflection to make a second, improved attempt. This 'experience-reflection-consolidation' loop helps the model learn more efficiently and consistently, turning vague feedback into concrete steps for improvement.
Why it matters?
This research is important because it shows a practical way to improve how language models learn from feedback, especially in complex situations. By adding a self-reflection step, the models become better at understanding and acting on limited information, leading to significant performance gains in tasks like controlling environments and using tools. This could lead to more capable and reliable AI systems that can learn and adapt more effectively in the real world.
Abstract
Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.