Analysing the Residual Stream of Language Models Under Knowledge Conflicts
Yu Zhao, Xiaotang Du, Giwon Hong, Aryo Pradipta Gema, Alessio Devoto, Hongru Wang, Xuanli He, Kam-Fai Wong, Pasquale Minervini
2024-10-28

Summary
This paper examines how large language models (LLMs) deal with conflicts between their internal knowledge and the information provided in prompts, focusing on how these conflicts affect their performance.
What's the problem?
LLMs store a lot of factual information, but sometimes this information conflicts with what they are given in the context, such as prompts or questions. These conflicts can lead to incorrect or outdated responses from the model, making it unreliable. Understanding how LLMs handle these conflicts is crucial for improving their accuracy and effectiveness.
What's the solution?
The authors investigate whether LLMs can recognize knowledge conflicts and determine which source of information they rely on when generating answers. They analyze the 'residual stream' of the model, which contains signals about these conflicts. By using probing tasks, they find that LLMs can detect knowledge conflicts internally, and they observe different patterns in the residual stream depending on whether the model relies on contextual or parametric knowledge. This understanding can help predict how models will respond in conflict situations and improve their performance.
Why it matters?
This research is important because it provides insights into how LLMs manage conflicting information, which is a common issue in AI applications. By understanding these mechanisms, researchers can develop better strategies to control knowledge selection in models, leading to more accurate and reliable AI systems that can provide trustworthy information.
Abstract
Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream shows significantly different patterns when the model relies on contextual knowledge versus parametric knowledge to resolve conflicts. This pattern can be employed to estimate the behaviour of LLMs when conflict happens and prevent unexpected answers before producing the answers. Our analysis offers insights into how LLMs internally manage knowledge conflicts and provides a foundation for developing methods to control the knowledge selection processes.