QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
2025-10-14
Summary
This paper introduces a new method called QeRL, which makes it easier and faster to improve large language models (LLMs) using a technique called reinforcement learning.
What's the problem?
Reinforcement learning is a powerful way to teach LLMs to reason and solve complex problems, but it usually requires a lot of computing power and memory, especially with very large models. This makes it difficult and expensive to train these models effectively, and sometimes even impossible to fit them onto available hardware.
What's the solution?
QeRL tackles this problem by using a combination of techniques. First, it reduces the precision of the numbers used in the model (quantization) to save memory and speed up calculations. Specifically, it uses something called NVFP4. Second, it uses a method called LoRA to efficiently update the model's parameters. Importantly, QeRL also realizes that the slight inaccuracies introduced by quantization can actually *help* the model explore different strategies, so it includes a system to intelligently control this 'noise' during training. This adaptive noise helps the model find even better solutions.
Why it matters?
QeRL is significant because it allows researchers to train much larger language models with reinforcement learning using less hardware. It’s the first framework to successfully train a 32 billion parameter LLM on a single high-end GPU. It also speeds up the training process and achieves performance comparable to more resource-intensive methods, even matching the accuracy of fully fine-tuning the model on challenging math problems. This means better LLMs can be developed more easily and affordably.
Abstract
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.