ReDit: Reward Dithering for Improved LLM Policy Optimization

Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu

2025-06-24

ReDit: Reward Dithering for Improved LLM Policy Optimization

Summary

This paper talks about ReDit, a method that improves training of large language models by adding random noise to the reward signals they get, making learning smoother and faster.

What's the problem?

The problem is that when AI models are trained using rewards that are only given in fixed steps (like right or wrong), the training can become unstable and slow because the model does not get smooth feedback.

What's the solution?

The researchers introduced reward dithering, which adds small random changes to the reward scores so the model receives more gradual signals. This helps provide better learning guidance, reduces training issues, and leads to faster improvements.

Why it matters?

This matters because it makes AI training more efficient and reliable, enabling models to learn better and faster, which can lead to smarter AI systems.

Abstract

ReDit, a reward dithering method, addresses issues in discrete reward systems by introducing noise, leading to smoother optimization and faster convergence compared to standard methods.

View Paper