PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

Reya Vir, Shreya Shankar, Harrison Chase, Will Fu-Hinthorn, Aditya Parameswaran

2025-04-22

PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom
Production Large Language Model Pipelines

Summary

This paper talks about PROMPTEVALS, a new dataset made up of lots of prompts and rules used to test how reliable large language models are when they're actually being used in real-world situations.

What's the problem?

The problem is that when companies or developers use language models in their products, it's hard to make sure the models always give safe, accurate, and helpful answers. There aren't enough good tools or examples to really test and set up strong rules, called guardrails, for these models in production.

What's the solution?

The researchers created the PROMPTEVALS dataset, which includes a huge collection of prompts and specific criteria for what counts as a good or bad response. They showed that by fine-tuning models with this dataset, the models could generate better rules and checks than even advanced models like GPT-4o, making them more reliable in real use.

Why it matters?

This matters because it helps make AI systems more trustworthy and safer for everyone, especially when they're used in important or sensitive situations like customer service, healthcare, or education.

Abstract

The PROMPTEVALS dataset contains a large number of prompts and assertion criteria for assessing the reliability of LLMs in production, demonstrating improved assertion generation performance with fine-tuned models compared to GPT-4o.

View Paper