Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
2025-02-28
Summary
This paper talks about a new method called Decoupled Value Policy Optimization (DVPO) for making AI language models better at understanding and following human preferences. It's like teaching a computer to write in a way that people like, but doing it more efficiently than before.
What's the problem?
The current way of teaching AI to write like humans, called PPO-based RLHF, is complicated and unstable. It's like trying to teach two things at once - how to write and how to judge the writing - which can get messy and use up a lot of computer power. Also, the AI can't get real-time feedback on its writing, which makes it hard for it to improve naturally.
What's the solution?
The researchers created DVPO, which separates the learning process into two parts. First, they train a 'global value model' that learns to judge how good the writing is. Then, they use this model to guide the AI in improving its writing. This is like having a writing coach who gives advice, but doesn't change their mind all the time. By doing this, DVPO makes the learning process simpler and more stable.
Why it matters?
This matters because it makes training AI to write like humans faster and uses less computer power. DVPO uses 40% less memory and is 35% faster than the old method. This could make it easier and cheaper to create AI that writes in ways people prefer, which is important for things like chatbots, virtual assistants, and other AI tools that interact with humans through text.
Abstract
Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to <PRE_TAG>actor-critic interdependence</POST_TAG>. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose Decoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained global value model (GVM). The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates <PRE_TAG>actor-critic interdependence</POST_TAG>, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.