Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, Florian Tramèr

2025-03-25

Summary

This paper is about protecting AI systems from being tricked into doing things they're not supposed to do.

What's the problem?

AI systems can be vulnerable to 'prompt injection' attacks, where someone feeds them misleading information that makes them act in unintended ways.

What's the solution?

The researchers created a system called CaMeL that acts as a protective layer around the AI, preventing it from being manipulated by untrusted data.

Why it matters?

This work matters because it helps make AI systems more secure and reliable, especially when they're used in important applications like managing sensitive information.

Abstract

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL relies on a notion of a capability to prevent the exfiltration of private data over unauthorized data flows. We demonstrate effectiveness of CaMeL by solving 67% of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.

View Paper