AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li
2024-07-18

Summary
This paper discusses AgentPoison, a new method for testing the security of large language model (LLM) agents by manipulating their memory or knowledge bases to create backdoor attacks.
What's the problem?
LLM agents are used in many important applications, but they often rely on external knowledge bases that may not be trustworthy. This can lead to safety concerns because if these agents retrieve incorrect or harmful information, it could result in serious consequences. There is a need to identify and address vulnerabilities in these systems.
What's the solution?
AgentPoison provides a way to test these vulnerabilities by injecting malicious instructions into the memory of LLM agents. It uses a technique called constrained optimization to create triggers that, when included in user instructions, cause the agent to retrieve harmful examples from its memory while still performing normally for benign instructions. This method does not require additional training of the model and can effectively manipulate the agent's behavior with very few malicious examples.
Why it matters?
This research is significant because it highlights potential security risks in LLM agents that could be exploited by attackers. By demonstrating how easily these systems can be compromised, AgentPoison emphasizes the need for better safeguards and monitoring in AI applications, ensuring that they remain safe and reliable for users.
Abstract
LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.