Stealing User Prompts from Mixture of Experts
Itay Yona, Ilia Shumailov, Jamie Hayes, Nicholas Carlini
2024-10-31

Summary
This paper discusses a security vulnerability in Mixture-of-Experts (MoE) models, where an attacker can exploit the way these models handle user queries to steal sensitive information from other users.
What's the problem?
Mixture-of-Experts models are designed to improve efficiency by routing parts of input data to specialized 'expert' networks. However, if an attacker can send their queries alongside a victim's queries, they can potentially access the victim's sensitive prompts. This is a significant security concern because it reveals how vulnerable these models can be to attacks that exploit their architecture.
What's the solution?
The authors demonstrate this vulnerability by conducting experiments on a specific MoE model called Mixtral. They show that by cleverly arranging their queries, an attacker can extract the entire prompt of another user using a relatively small number of queries. Their method takes advantage of how the model processes and prioritizes inputs, revealing a new type of risk associated with using these advanced models.
Why it matters?
This research is important because it highlights a critical security flaw in Mixture-of-Experts models, which are becoming increasingly popular in AI applications. Understanding these vulnerabilities is essential for developers and researchers to create safer AI systems that protect user data and maintain privacy, especially as these technologies are used in more sensitive areas like finance and healthcare.
Abstract
Mixture-of-Experts (MoE) models improve the efficiency and scalability of dense language models by routing each token to a small number of experts in each layer. In this paper, we show how an adversary that can arrange for their queries to appear in the same batch of examples as a victim's queries can exploit Expert-Choice-Routing to fully disclose a victim's prompt. We successfully demonstrate the effectiveness of this attack on a two-layer Mixtral model, exploiting the tie-handling behavior of the torch.topk CUDA implementation. Our results show that we can extract the entire prompt using O({VM}^2) queries (with vocabulary size V and prompt length M) or 100 queries on average per token in the setting we consider. This is the first attack to exploit architectural flaws for the purpose of extracting user prompts, introducing a new class of LLM vulnerabilities.