Soft Instruction De-escalation Defense

Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov

2025-10-27

Summary

This paper addresses the security risk of 'prompt injections' when using powerful AI language models (LLMs) as agents that interact with the real world, and proposes a method to make these agents more resistant to malicious instructions.

What's the problem?

When LLMs are used to control tools or take actions based on information they receive, someone could trick them into doing something unintended or harmful by cleverly crafting their input. This is called a prompt injection. Because these LLMs are often given information from sources they don't fully trust, like user input or data from the internet, they're vulnerable to these attacks. Essentially, a bad actor could sneak instructions *into* the data the LLM is processing, hijacking its intended purpose.

What's the solution?

The researchers developed a system called SIC, which stands for Soft Instruction Control. It works by repeatedly checking any incoming data for potentially harmful instructions. If it finds something suspicious, it tries to rewrite or remove it. This process happens multiple times, allowing SIC to catch injections that might be missed on the first try. If, after several attempts, there's still a risk of malicious instructions being present, the agent simply stops to prevent any harm. It's like a security guard carefully inspecting packages before letting them into a building, and refusing delivery if something seems off.

Why it matters?

This work is important because as LLMs become more powerful and are used to control more things in the real world, protecting them from attacks becomes crucial. While SIC isn't perfect – a clever attacker can still sometimes succeed – it significantly increases the difficulty of launching a successful prompt injection, making these AI systems much safer and more reliable. It sets a new, higher standard for security in this area.

Abstract

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.

View Paper