Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang

2025-09-12

Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Summary

This paper addresses a key issue with using large language models (LLMs) to complete complex tasks that take many steps, like shopping online or following instructions in a virtual environment.

What's the problem?

When LLMs are learning to do these long tasks, they often only get feedback at the very end – did they succeed or fail? This makes it hard for the LLM to figure out *which* of its many steps were good or bad. Existing solutions try to create more frequent feedback, but the paper identifies a deeper problem: LLMs tend to make very small changes even when they're pretty sure they're doing the right thing, and potentially large, unstable changes when they're unsure. This is because the way LLMs learn is tied to how confident they are, and that can mess up the learning process.

What's the solution?

The researchers propose a new method called Entropy-Modulated Policy Gradients, or EMPG. Essentially, EMPG adjusts how much the LLM learns from each step based on how certain it was about that step and whether the final task was successful. If the LLM was confident and correct, it learns more. If it was confident but wrong, it's penalized. If it was unsure, the learning is slowed down to prevent instability. They also added a bonus to encourage the LLM to find solutions that are more predictable and consistent.

Why it matters?

This work is important because it improves the ability of LLMs to tackle complex, multi-step tasks. By addressing the fundamental issue of how LLMs learn from limited feedback, EMPG allows them to learn more efficiently and reliably, leading to better performance on tasks like web shopping, following instructions, and searching for information. This brings us closer to having AI agents that can effectively handle real-world problems.

Abstract

In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/

View Paper