Alignment Makes Language Models Normative, Not Descriptive
Eilam Shapira, Moshe Tennenholtz, Roi Reichart
2026-03-19
Summary
This research investigates whether making language models 'aligned' with human preferences – meaning they respond in ways people like – actually makes them better at *predicting* what humans will do in different situations.
What's the problem?
Currently, a lot of effort goes into 'aligning' large language models to be helpful and harmless, based on human feedback. However, it wasn't clear if this alignment process actually makes the models better at understanding and predicting real human behavior, or if it just makes them better at *seeming* helpful. The core issue is that being preferred isn't the same as being a good model of how people actually act, especially in complex situations.
What's the solution?
The researchers compared the predictions of both 'base' (unaligned) and 'aligned' language models to actual choices made by people in a variety of games. These weren't simple games, but strategic ones like bargaining, negotiation, and repeated interactions where what you do depends on what the other person did before. They looked at over 10,000 real human decisions. They also tested simpler, one-time games where there's a clear 'right' answer. The key was comparing how well each type of model predicted human choices in both kinds of scenarios.
Why it matters?
The study found a surprising trade-off. Aligned models are better at predicting what people do in situations where people generally follow logical rules, like simple games. But, in more complex, strategic interactions, the *unaligned* models were much better at predicting human behavior. This suggests that alignment pushes models towards what people *should* do, rather than what they *actually* do, and that there's a fundamental difference between making models useful for humans and making them accurate representations of human thought processes. This is important because if we want to use these models to understand people, we need to be aware of this limitation.
Abstract
Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.