Granite Guardian
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri
2024-12-11

Summary
This paper talks about Granite Guardian, a set of models designed to detect and manage risks in prompts and responses generated by large language models (LLMs), ensuring their safe and responsible use.
What's the problem?
As AI-generated content becomes more common, there are increasing concerns about harmful outputs, such as biased language, profanity, or misinformation. Traditional risk detection methods often miss important issues like 'jailbreaking' (manipulating the model to bypass restrictions) and problems specific to retrieval-augmented generation (RAG), which combines information from various sources.
What's the solution?
The authors introduce Granite Guardian, which includes a comprehensive suite of models that can identify a wide range of risks. These models are trained on a unique dataset that combines human annotations and synthetic data. They cover various risk areas, including social bias, violence, and hallucination-related issues (where the model generates incorrect or misleading information). Granite Guardian is designed to be open-source, allowing anyone to use it for improving the safety of AI systems.
Why it matters?
This research is important because it provides a robust tool for ensuring that AI-generated content is safe and appropriate. By addressing various risks comprehensively, Granite Guardian helps promote responsible AI development and usage, making it a valuable resource for developers and organizations that rely on large language models.
Abstract
We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community. https://github.com/ibm-granite/granite-guardian