Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento
2025-12-26
Summary
This paper explores a new way to train powerful AI models, specifically those that generate text or actions step-by-step, to learn more efficiently, especially when rewards are hard to come by.
What's the problem?
Typically, these AI models learn by trying things out one step at a time and getting feedback. But when the feedback is rare or delayed, like in complex tasks, the model struggles to learn because it takes too long to stumble upon rewarding actions. It's like trying to find a needle in a haystack by randomly searching, one tiny spot at a time.
What's the solution?
The researchers found a way to let the AI model learn at a higher level of abstraction. Instead of controlling each individual step, they created a second model that controls the *internal* workings of the first model. This second model learns to create 'chunks' of actions that achieve meaningful goals, and it also learns when each chunk should end. Think of it like teaching someone to drive by giving them instructions like 'turn left at the next intersection' instead of telling them to adjust the steering wheel a tiny bit at a time.
Why it matters?
This approach, called 'internal reinforcement learning', allows the AI to learn much faster and more effectively in situations where rewards are sparse. It suggests a promising path for building more intelligent and capable AI systems, especially within large 'foundation models' that can be adapted to many different tasks, and opens the door to more complex, hierarchical learning within these models.
Abstract
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.