RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quy-Anh Dang, Chris Ngo, Truong-Son Hy

2026-01-08

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Summary

This paper introduces RedBench, a new and improved collection of tests designed to find weaknesses in large language models (LLMs), which are AI systems that can understand and generate human-like text.

What's the problem?

As LLMs are used in more important applications, like healthcare or finance, it's crucial to make sure they can't be tricked into giving harmful or incorrect responses. Existing tests for finding these weaknesses, called 'red teaming' datasets, are often disorganized, don't cover enough different situations, and quickly become outdated, making it hard to reliably assess how vulnerable these models really are.

What's the solution?

The researchers created RedBench by combining 37 different existing datasets into one large, standardized collection containing over 29,000 test cases. They also developed a clear system for categorizing the types of risks these tests reveal, grouping them into 22 risk categories and 19 different areas of knowledge. They then used RedBench to test several modern LLMs and made the dataset and their testing code publicly available.

Why it matters?

RedBench provides a consistent and thorough way to evaluate the safety of LLMs, allowing researchers to compare different models and track improvements over time. This is important for building trust in these systems and ensuring they can be safely used in real-world applications where mistakes could have serious consequences.

Abstract

As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: https://github.com/knoveleng/redeval

View Paper