Differentiable Evolutionary Reinforcement Learning

Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, Difan Zou

2025-12-17

Differentiable Evolutionary Reinforcement Learning

Summary

This paper tackles the challenge of designing good reward systems for teaching artificial intelligence agents to perform complex tasks, like reasoning and problem-solving.

What's the problem?

In reinforcement learning, where AI learns through trial and error, figuring out *how* to reward an agent is surprisingly hard. If the reward isn't well-designed, the agent might learn the wrong thing or struggle to learn at all. Existing methods for automatically finding good rewards often treat the reward system like a black box, meaning they don't understand *why* certain rewards work better than others, and they can be inefficient.

What's the solution?

The researchers introduce a new approach called Differentiable Evolutionary Reinforcement Learning, or DERL. It works by automatically evolving a reward function, but unlike previous methods, DERL can actually 'understand' how changes to the reward affect the agent's performance. It does this by using a process similar to how an agent learns – it gets feedback (the agent's success or failure) and adjusts the reward accordingly. DERL builds rewards from simpler components, and then uses this feedback to refine those components, making the rewards more effective over time.

Why it matters?

This work is important because it allows AI agents to learn complex tasks without needing humans to carefully design every detail of the reward system. By automatically discovering effective rewards, DERL can create AI that is better at adapting to new situations and solving problems, especially in areas like robotics, scientific discovery, and even math.

Abstract

The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.

View Paper