How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo
2024-06-18

Summary
This paper explores how large language models (LLMs) learn and retain factual knowledge during their initial training phase, known as pretraining. It investigates the mechanisms behind this knowledge acquisition and highlights important findings about how these models function.
What's the problem?
While LLMs have shown they can store a lot of factual information, there is still a limited understanding of how they actually learn this knowledge during training. Many existing methods focus on simply increasing the amount of data used for training, but this doesn't always lead to better results. Additionally, LLMs often forget information over time, which can affect their performance when recalling facts.
What's the solution?
The authors conducted experiments to study how LLMs acquire and retain factual knowledge under different training conditions. They found that just using more data doesn't significantly improve the model's ability to remember facts. Instead, they discovered a pattern where LLMs trained with larger batches of data are better at retaining what they've learned. They also noted that using duplicated data can lead to faster forgetting of information. The research suggests that LLMs acquire knowledge by gradually increasing the chances of learning specific facts during training, but this knowledge can fade away if not reinforced.
Why it matters?
This research is important because it helps us understand the learning process of AI models better. By figuring out how LLMs acquire and forget knowledge, we can improve their design and training methods, leading to more reliable and accurate AI systems. This understanding could enhance applications in various fields, such as education, customer service, and content generation, where accurate information is crucial.
Abstract
Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.