Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader, Nicholas Meade, Siva Reddy
2025-03-12
Summary
This paper talks about how AI search tools can accidentally find harmful information when asked risky questions, even when paired with safety-focused AI like Llama3.
What's the problem?
AI tools that search for information online can be tricked into finding dangerous or harmful content (like hacking guides or fake news) because they follow user instructions too well and don’t filter results properly.
What's the solution?
Researchers tested six popular AI search tools and found that most could find harmful content over half the time, especially when users phrase questions in tricky ways that exploit how the AI follows instructions.
Why it matters?
This shows that even ‘safe’ AI systems can spread harmful info if their search tools aren’t carefully guarded, which could lead to real-world dangers like cyberattacks or misinformation spreading faster.
Abstract
Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.