Defeating the Training-Inference Mismatch via FP16

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

2025-11-03

Defeating the Training-Inference Mismatch via FP16

Summary

This paper investigates why training large language models using reinforcement learning can be unstable, and proposes a surprisingly simple fix.

What's the problem?

When you take a powerful language model and try to improve it with reinforcement learning, the training process often becomes erratic and doesn't work well. Previous attempts to fix this focused on tweaking the learning process itself, but this research shows the core issue isn't the learning *method*, but rather how numbers are represented inside the computer. Specifically, a common way to store numbers called BF16, while good for some things, introduces small errors that add up and throw off the training, creating a mismatch between how the model learns and how it actually performs.

What's the solution?

The researchers found that switching back to an older number format called FP16 completely solves the instability problem. It's a very easy change – just a few lines of code – and doesn't require any changes to the model's structure or the learning algorithms used. Using FP16 consistently leads to smoother training, faster improvement, and better overall results across different tasks and software frameworks.

Why it matters?

This is important because it means we've been potentially making reinforcement learning harder than it needs to be. By simply changing the way numbers are stored, we can get much more reliable and effective results when fine-tuning these large language models. It encourages researchers and developers to rethink the trade-offs between speed and precision when training these powerful AI systems.

Abstract

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to FP16 effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

View Paper