Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

Jean Vassoyan, Nathanaël Beau, Roman Plaud

2025-02-13

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to
Enhance RL Fine-Tuning

Summary

This paper talks about a new way to make large language models (LLMs) better at solving long-term problems by changing how they learn through reinforcement learning, focusing on exploring important parts of their responses more freely.

What's the problem?

When fine-tuning LLMs with reinforcement learning, it's tricky to balance exploring new solutions with keeping the model's existing knowledge intact. Currently, this is managed using a technique called KL penalty, but it can limit the model's ability to find better solutions for long-term goals.

What's the solution?

The researchers studied how a small language model explores solutions on a simple math task. They found that some parts of the model's response, which they call 'critical tokens,' have a big impact on the final result. Based on this, they created a new method that changes the KL penalty to allow more exploration on these critical tokens, making the reinforcement learning process more effective.

Why it matters?

This matters because it could help AI models become better at solving complex, long-term problems without losing their basic abilities. By focusing on the most important parts of their responses, AI models could learn more efficiently and potentially tackle tasks that were previously too difficult. This could lead to more capable AI assistants, better problem-solving tools, and advancements in fields that require long-term planning and reasoning.

Abstract

The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.

View Paper