MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu
2024-10-09

Summary
This paper presents MA-RLHF, a new framework that improves reinforcement learning from human feedback by using macro actions to help large language models (LLMs) learn more effectively.
What's the problem?
Reinforcement learning from human feedback (RLHF) can be challenging because it often struggles with the credit assignment problem. This problem occurs when a model receives rewards for its actions but has difficulty figuring out which specific actions led to those rewards, especially when there are many actions over a long sequence. This makes it hard for the model to learn efficiently and slows down its ability to improve.
What's the solution?
The authors propose MA-RLHF, which incorporates macro actions—essentially groups of related actions—into the learning process. By focusing on these higher-level actions instead of individual tokens, the model can better connect its actions to the rewards it receives. This approach reduces the time and complexity needed for the model to learn, allowing it to make faster and more accurate decisions. The authors tested MA-RLHF on various tasks, such as text summarization and dialogue generation, and found that it significantly improved performance compared to standard RLHF methods.
Why it matters?
This research is important because it enhances how LLMs learn from human feedback, making them more efficient and effective in understanding and generating text. By addressing the credit assignment problem, MA-RLHF could lead to better AI systems that align more closely with human preferences, improving applications like chatbots, content creation, and other areas where AI interacts with people.
Abstract
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at https://github.com/ernie-research/MA-RLHF .