WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, Soujanya Poria
2024-08-08

Summary
This paper introduces WalledEval, a toolkit designed to test the safety of large language models (LLMs) by evaluating how well they handle various safety concerns.
What's the problem?
As large language models are increasingly used in real-world applications, ensuring their safety and reliability is crucial. Current methods for testing these models often lack comprehensiveness and may not adequately assess their ability to handle unsafe or biased content. This raises concerns about the potential harm these models could cause if they generate inappropriate or harmful responses.
What's the solution?
WalledEval provides a comprehensive evaluation framework that includes over 35 safety benchmarks covering different aspects of model behavior, such as multilingual safety and response to harmful prompts. It allows testing of both the models themselves and the judges that evaluate their outputs. The toolkit also introduces new tools like WalledGuard for content moderation and SGXSTest for assessing exaggerated safety in cultural contexts, helping to ensure that models behave safely across various scenarios.
Why it matters?
This research is important because it addresses the critical need for robust safety evaluations of AI systems. By providing a thorough testing toolkit, WalledEval helps developers identify and mitigate risks associated with language models, contributing to the creation of safer AI technologies that can be trusted in sensitive applications.
Abstract
WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking, and incorporates custom mutators to test safety against various text-style mutations such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural contexts. We make WalledEval publicly available at https://github.com/walledai/walledevalA.