Building a Foundational Guardrail for General Agentic Systems via Synthetic Data
Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang
2025-10-14
Summary
This paper focuses on making AI agents, specifically those powered by large language models, safer by preventing them from making harmful plans *before* they actually do anything. It introduces new tools and tests to help identify and stop risky behavior in these agents.
What's the problem?
Currently, most safety measures for AI agents kick in *after* they've already taken an action, which is too late if that action is dangerous. It's hard to supervise what these agents are planning to do, and there's a lack of good data and models specifically designed to catch problems at the planning stage. There's also no standard way to reliably test how well these safety systems are working.
What's the solution?
The researchers tackled this by creating three things. First, they built a system called AuraGen to automatically generate lots of realistic scenarios, some safe and some with different levels of risk, to train safety models. Second, they developed a 'guardrail' model called Safiron that can analyze an agent's plan, identify potential risks, categorize those risks, and explain *why* it thinks something is dangerous. Finally, they released a benchmark called Pre-Exec Bench, a set of challenging scenarios to test how well these safety systems perform, covering different tools and possible paths the agent could take.
Why it matters?
This work is important because as AI agents become more powerful and are used in more real-world situations, preventing them from making harmful decisions is crucial. By focusing on pre-execution safety and providing tools for data generation, model building, and evaluation, this research helps pave the way for building more reliable and trustworthy AI systems that won't cause unintended consequences.
Abstract
While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.