Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

Ahmed Hendawy, Henrik Metternich, Théo Vincent, Mahdi Kallel, Jan Peters, Carlo D'Eramo

2025-10-10

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

Summary

This paper introduces a new way to improve how reinforcement learning algorithms learn to estimate how good different actions are, specifically focusing on making the learning process faster and more stable.

What's the problem?

In reinforcement learning, algorithms often use two networks: one to make decisions ('online') and another to provide a stable target for learning ('target'). Using a target network makes learning stable, but it can be slow to adapt. Using the online network directly as the target is faster, but often leads to unstable learning because the estimates can bounce around too much. Essentially, there's a trade-off between speed and stability.

What's the solution?

The researchers developed a method called MINTO, which cleverly combines the best aspects of both approaches. Instead of simply choosing between the target network's estimate and the online network's estimate, MINTO uses the *minimum* of the two. This helps prevent the online network from making overly optimistic (and unstable) estimates while still allowing for faster learning than using a traditional, slow-moving target network.

Why it matters?

This new method is important because it can significantly improve the performance of many different reinforcement learning algorithms. It's easy to add to existing algorithms without much extra work, and it works well in a variety of situations, whether the algorithm is learning from experience directly or from a pre-collected dataset, and whether the actions it can take are simple choices or continuous values. This means it has the potential to make reinforcement learning more practical and effective for a wider range of problems.

Abstract

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

View Paper