Efficient Process Reward Model Training via Active Learning
Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, Longxu Dou
2025-04-16
Summary
This paper talks about ActPRM, a new way to train AI models that judge or reward certain processes, using a smarter approach to pick which data to learn from so the training is faster and cheaper.
What's the problem?
The problem is that teaching AI models to know which actions or processes are good or bad usually needs a lot of labeled data, which is expensive and time-consuming for humans to create. If you use too much data, it wastes resources, but if you use too little, the model might not learn well.
What's the solution?
The researchers developed ActPRM, which uses active learning to focus only on the data that the model is most unsure about. By picking out these uncertain examples for labeling and training, the model learns more efficiently and needs fewer labeled examples to reach high performance.
Why it matters?
This matters because it helps save time and money when training AI, making it easier for more people and companies to build smart, effective models without needing massive amounts of labeled data. It also means AI can be improved faster and with less effort.
Abstract
An active learning method named ActPRM selects uncertain data for training Process Reward Models, reducing annotation costs while maintaining or improving performance.