Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
Qingcheng Zeng, Mingyu Jin, Qinkai Yu, Zhenting Wang, Wenyue Hua, Zihao Zhou, Guangyan Sun, Yanda Meng, Shiqing Ma, Qifan Wang, Felix Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang
2024-07-17

Summary
This paper discusses the vulnerabilities of large language models (LLMs) related to uncertainty estimation and how attackers can manipulate this uncertainty through backdoor attacks.
What's the problem?
Large language models are used in many important areas where it's crucial that their answers are reliable. To measure how trustworthy these answers are, researchers often look at something called uncertainty estimation, which predicts how likely it is that a model's answer is correct. However, this paper highlights that the methods used to estimate uncertainty can be fragile and can be exploited by attackers.
What's the solution?
The authors demonstrate that attackers can insert a backdoor into LLMs. This backdoor can be triggered by specific inputs, allowing the attacker to manipulate the model's uncertainty without changing the final output. This means that while the model might still give an answer, the confidence level it reports could be artificially altered, making it seem less reliable. They show that this manipulation can undermine the model's ability to self-evaluate its answers, especially in multiple-choice questions, achieving a 100% success rate across different methods of triggering the backdoor.
Why it matters?
This research is significant because it exposes a serious security risk for large language models, which are increasingly used in critical applications. Understanding these vulnerabilities is essential for developing better defenses against such attacks, ensuring that LLMs remain reliable and trustworthy tools in various fields.
Abstract
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the probability distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction remains unchanged. Our experimental results demonstrate that this attack effectively undermines the model's self-evaluation reliability in multiple-choice questions. For instance, we achieved a 100 attack success rate (ASR) across three different triggering strategies in four models. Further, we investigate whether this manipulation generalizes across different prompts and domains. This work highlights a significant threat to the reliability of LLMs and underscores the need for future defenses against such attacks. The code is available at https://github.com/qcznlp/uncertainty_attack.