Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Itay Itzhak, Yonatan Belinkov, Gabriel Stanovsky
2025-07-16
Summary
This paper talks about how large language models develop cognitive biases, which are patterns of thinking that can be unfair or irrational, and finds that these biases mostly come from the initial pretraining phase rather than from later fine-tuning or random training differences.
What's the problem?
The problem is that these AI models often show biased behavior similar to human biases, but it's unclear whether these biases appear during the early training on huge amounts of text, during the fine-tuning phase where models learn specific tasks, or just from randomness in training.
What's the solution?
The researchers used a careful two-step experiment where they fine-tuned models multiple times with different random setups and swapped training tasks between different models to separate the effects of pretraining, fine-tuning, and randomness. They found that models with the same pretraining have similar biases, no matter the fine-tuning data, showing pretraining plays the biggest role.
Why it matters?
This matters because knowing where biases come from helps AI developers focus on fixing them early in the training process, leading to fairer and more reliable AI systems that make better decisions and treats all people and ideas more justly.
Abstract
Pretraining significantly shapes cognitive biases in large language models, more so than finetuning or training randomness, as revealed by a causal experimental approach involving cross-tuning.