The Differences Between Direct Alignment Algorithms are a Blur

Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov

2025-02-04

The Differences Between Direct Alignment Algorithms are a Blur

Summary

This paper talks about Direct Alignment Algorithms (DAAs), which are methods to make AI language models behave the way humans want. These algorithms simplify the process by skipping some of the complicated steps used in older methods and directly teaching the AI how to align with human preferences. The researchers studied different types of DAAs and found ways to improve their performance.

What's the problem?

AI language models often need to be adjusted so they respond in ways that match human values and expectations. Traditional methods, like Reinforcement Learning from Human Feedback (RLHF), involve many complex steps, such as creating reward models and using reinforcement learning. DAAs were introduced to simplify this process, but not all DAAs work equally well. One-stage DAAs, which skip an extra fine-tuning step, tend to perform worse than two-stage methods, and it wasn’t clear why or how to fix this.

What's the solution?

The researchers compared one-stage and two-stage DAAs to understand what makes them different. They found that adding a supervised fine-tuning (SFT) step to one-stage DAAs significantly improved their performance, making them as good as the two-stage methods. They also introduced a new parameter, called beta, which helps control how strongly the AI learns human preferences. Through experiments, they discovered that the most important factor for success was whether the algorithm compared responses in pairs or evaluated them individually.

Why it matters?

This research is important because it helps make AI systems more reliable and easier to train. By improving DAAs, we can create AI models that better understand and respond to human needs without relying on overly complicated training processes. This makes AI alignment more accessible and efficient, which is crucial as these technologies become more widely used in areas like customer service, education, and virtual assistants.

Abstract

Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the beta parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +3.46 (ORPO) and +8.27 (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or <PRE_TAG>pointwise objectives</POST_TAG>, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.

View Paper