Value Drifts: Tracing Value Alignment During LLM Post-Training
Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Vered Shwartz, Siva Reddy
2025-11-03
Summary
This paper investigates how large language models, or LLMs, develop their understanding of human values during their training process, specifically after their initial training is complete.
What's the problem?
Currently, most research focuses on *whether* a finished LLM aligns with human values, but not *how* those values are actually learned during the fine-tuning stages. We don't really understand when and why a model starts to reflect certain values, or how different training methods affect this process. It's like checking if a student understands history, but not looking at how they learned it in class.
What's the solution?
The researchers trained different versions of Llama-3 and Qwen-3 models, using common techniques to improve their responses based on human preferences. They carefully tracked how the models’ values changed throughout this fine-tuning process. They also created a special dataset where they could directly control what values the model was exposed to. By comparing the results, they found that the initial fine-tuning stage is the most important for establishing a model’s values, and later preference optimization doesn’t usually change them much. Different preference optimization methods can also lead to different value outcomes, even with the same data.
Why it matters?
This research is important because it gives us a better understanding of how to build LLMs that are more aligned with what humans consider to be good and ethical. Knowing when and how values are learned allows us to carefully select training data and methods to create models that are safer, more reliable, and more helpful to people. It’s a step towards making sure AI systems reflect our values, rather than developing them on their own.
Abstract
As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.