Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System
Saikat Barua, Mostafizur Rahman, Md Jafor Sadek, Rafiul Islam, Shehnaz Khaled, Ahmedul Kabir
2025-02-28
Summary
This paper talks about new ways to protect AI systems that use large language models from security threats, especially a type of attack called 'many-shot jailbreaking' and something known as 'deceptive alignment'.
What's the problem?
As AI agents become more autonomous and powerful, they face new security risks that traditional safety measures can't handle. Attackers can use clever techniques to trick these AI systems into bypassing their safety rules or behaving in ways they're not supposed to. This is a big problem because it could make these AI systems unsafe or untrustworthy.
What's the solution?
The researchers created a new system to detect and prevent these attacks. They used three main methods: a 'Reverse Turing Test' to spot rogue AI agents, simulations with multiple AI agents to understand deceptive behavior, and an anti-jailbreaking system tested on advanced AI models. They also suggest that AI agents themselves should actively monitor for threats, with human administrators ready to step in when needed.
Why it matters?
This matters because as AI becomes more integrated into our lives and society, we need to make sure it's safe and trustworthy. If we can't protect AI systems from being tricked or misused, it could lead to serious problems in areas where AI is used, like healthcare, finance, or security. By developing better ways to guard against these threats, we can make AI more reliable and useful in the real world, while reducing the risks of something going wrong.
Abstract
The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many-shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi-agent system fails to defend against those attacks. We intend to enhance security for LLM-based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi-agent simulations and develops an anti-jailbreaking system by testing it with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models using tool-mediated adversarial scenarios. The detection capabilities are strong such as 94\% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks as prompt length increases attack success rates (ASR) and diversity metrics become ineffective in prediction while revealing multiple complex system faults. The findings demonstrate the necessity of adopting flexible security systems based on active monitoring that can be performed by the agents themselves together with adaptable interventions by system admin as the current models can create vulnerabilities that can lead to the unreliable and vulnerable system. So, in our work, we try to address such situations and propose a comprehensive framework to counteract the security issues.