AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, Binxin Hu, Ling Tang, Jilin Mei, Dadi Guo, Leitao Yuan, Junyao Yang, Guanxu Chen, Qihao Lin, Yi Yu, Bo Zhang, Jiaxuan Guo, Jie Zhang

2026-01-28

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Summary

This paper addresses the growing safety concerns with AI agents, which are becoming more capable of independently using tools and interacting with their environment.

What's the problem?

As AI agents get smarter and more autonomous, it's becoming harder to ensure they behave safely and securely. Existing safety systems, often called 'guardrails,' aren't sophisticated enough to understand the complex risks these agents pose. They typically just say something is 'safe' or 'unsafe' without explaining *why*, making it difficult to fix problems or build trust.

What's the solution?

The researchers created a new system called AgentDoG, which acts as a more intelligent guardrail. They started by categorizing all the possible ways an AI agent could be risky, looking at *where* the risk comes from, *how* it might fail, and *what* the consequences could be. Then, they built a benchmark called ATBench to test agent safety in many different situations. AgentDoG doesn't just flag unsafe actions; it diagnoses *why* an action is unsafe or even unreasonable, providing a detailed explanation. They released different versions of AgentDoG, varying in size, compatible with popular AI models like Qwen and Llama, and made everything publicly available.

Why it matters?

This work is important because it moves beyond simple safety checks to provide a deeper understanding of AI agent behavior. By diagnosing the root causes of risky actions, it's easier to improve agent safety and build AI systems we can truly rely on. This is crucial as AI agents become more integrated into our daily lives and take on more complex tasks.

Abstract

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

View Paper