T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An, Seanie Lee, Sung Ju Hwang

2026-03-26

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Summary

This paper investigates how to find weaknesses in AI systems that aren't just about getting the AI to *say* harmful things, but about getting it to *do* harmful things using tools it has access to.

What's the problem?

Previous methods for testing AI safety focused on making the AI generate bad text. However, modern AI systems, especially those that can use tools and take actions in the real world, have new vulnerabilities. Simply checking the text output doesn't reveal these problems because the danger comes from the sequence of actions the AI takes, not just what it says at the end. Existing tests don't account for how an AI's actions build on each other over time.

What's the solution?

The researchers developed a new method called T-MAP. This method automatically creates prompts designed to trick the AI into performing a harmful task by carefully planning out a series of steps, or a 'trajectory,' that leads to the desired (harmful) outcome. It learns from each attempt, evolving the prompts to become more effective at bypassing safety measures and achieving the goal through actual tool use.

Why it matters?

This research is important because it shows that even the most advanced AI models, like GPT-5.2 and Gemini, are vulnerable to these kinds of attacks. It highlights the need to test AI systems not just on what they say, but on what they *do* when given access to tools, and it provides a new way to find and fix these weaknesses before they can be exploited.

Abstract

While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.

View Paper