AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu

2025-10-07

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Summary

This paper explores the security risks of using multiple AI agents working together, specifically how they can be tricked into doing harmful things through clever prompts or by agents collaborating to bypass safety measures.

What's the problem?

When you have several AI agents interacting, it's hard to keep them all safe. Current methods either have each agent check itself for bad instructions, which isn't very effective because one agent can't see the whole picture, or they use a central 'guard' to monitor everything. The 'guard' approach is problematic because it can become overwhelmed, is a single point of failure, and adds extra cost and complexity to the system.

What's the solution?

The researchers developed a system called AdvEvo-MARL. Instead of relying on external guards, they train the AI agents themselves to be both good at their jobs *and* resistant to attacks. They do this by setting up an adversarial learning environment where 'attacker' agents try to find ways to trick the 'defender' agents, and both sides constantly improve through this competition. To help the agents learn effectively, they share information about how well their group is doing, which encourages teamwork and makes the learning process more stable.

Why it matters?

This research is important because it shows a way to build safer multi-agent AI systems without adding extra layers of security that slow things down or create vulnerabilities. By building safety *into* the agents themselves, they can maintain both their usefulness and their security, making these systems more reliable and trustworthy.

Abstract

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

View Paper