Meta-RL Induces Exploration in Language Agents

Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

2025-12-22

Meta-RL Induces Exploration in Language Agents

Summary

This paper introduces LaMer, a new way to train AI agents powered by large language models to be better at solving complex tasks that require them to learn through trial and error.

What's the problem?

Currently, when you train these AI agents using reinforcement learning, they often struggle when they need to actively try different things to figure out the best approach, and they aren't very good at learning from their mistakes on the fly. They get stuck easily and don't adapt well to new situations.

What's the solution?

LaMer tackles this by using a 'meta-reinforcement learning' approach. It has two main parts: first, it trains the agent across many different scenarios to encourage it to explore and aim for long-term rewards. Second, it allows the agent to quickly adjust its strategy based on feedback it receives *during* a task, without needing to completely retrain the model. This adjustment happens by the agent 'reflecting' on what's happening and changing its approach based on that.

Why it matters?

This research is important because it shows a more effective way to build AI agents that can handle complex, real-world problems. LaMer significantly improves performance in tasks like puzzle solving and online shopping simulations, and it's also better at adapting to new and challenging tasks, making these AI agents more robust and useful.

Abstract

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

View Paper