Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, Minlie Huang

2024-12-24

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Summary

This paper talks about Agent-SafetyBench, a new system created to evaluate the safety of large language model (LLM) agents as they interact in various environments and perform tasks.

What's the problem?

As LLMs are used more as agents in real-world applications, they face new safety challenges that haven't been thoroughly assessed. There are currently no comprehensive benchmarks to evaluate how safe these agents are when they interact with users or tools, which makes it hard to identify and fix potential safety issues.

What's the solution?

The authors developed Agent-SafetyBench, which includes 349 different interaction environments and 2,000 test cases to assess the safety of LLM agents across eight categories of safety risks. They tested 16 popular LLM agents and found that none scored above 60% on safety measures, indicating significant room for improvement. The study identified two main problems: a lack of robustness (agents failing under changing conditions) and poor risk awareness (agents not recognizing potential dangers). They also concluded that simply using defensive prompts isn't enough to ensure safety.

Why it matters?

This research is important because it highlights the critical need for better safety evaluations in AI systems that interact with people. By identifying weaknesses in current LLM agents, Agent-SafetyBench provides a framework for improving their safety, which is essential as AI becomes more integrated into everyday life. Ensuring that these agents operate safely is crucial for building trust and preventing harmful interactions.

Abstract

As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through quantitative analysis, we identify critical failure modes and summarize two fundamental safety detects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone is insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. We release Agent-SafetyBench at https://github.com/thu-coai/Agent-SafetyBench to facilitate further research and innovation in agent safety evaluation and improvement.

View Paper