Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models
Shiting Huang, Zecheng Li, Yu Zeng, Qingnan Ren, Zhen Fang, Qisheng Su, Kou Shi, Lin Chen, Zehui Chen, Feng Zhao
2026-02-12
Summary
This paper focuses on improving how Large Language Models (LLMs) learn to reason by building on a technique called Reinforcement Learning with Verifiable Rewards. It introduces a new method, Meta-Experience Learning, to help LLMs learn *from* their mistakes in a more human-like way.
What's the problem?
While using rewards to train LLMs to reason works pretty well, it's limited because the models don't really understand *why* they made a mistake. They just know they were wrong. Humans, when learning, analyze errors to build up general knowledge about what *not* to do – this paper calls this 'meta-experience'. Current methods don't effectively capture and reuse this kind of learning from past errors, hindering their ability to improve reasoning skills efficiently.
What's the solution?
The researchers developed Meta-Experience Learning (MEL). This system doesn't just practice and check answers; it actively analyzes *where* the LLM went wrong in its reasoning process. It compares correct and incorrect attempts, pinpointing the exact step where the error occurred. Then, it summarizes these errors into reusable 'meta-experience' and stores this knowledge directly within the LLM itself, essentially teaching it to avoid similar mistakes in the future. This is done by adjusting the model to favor reasoning paths that lead to correct answers, based on the contrast between good and bad attempts.
Why it matters?
This work is important because it makes LLMs better at reasoning and problem-solving. By allowing them to learn from their mistakes in a more sophisticated way, the models become more reliable and efficient. The experiments showed a noticeable improvement in performance on various reasoning tasks, meaning this method could lead to more capable and intelligent AI systems.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.