FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality
Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, Scott Yih
2025-08-07
Summary
This paper talks about FACTORY, a new collection of prompts that are checked by humans to test how accurate and truthful long answers from language models really are. It helps measure how well AI models stick to facts in longer responses.
What's the problem?
The problem is that existing datasets used to check if language models give factual information in long answers are often not challenging enough or accurate, which makes it hard to truly evaluate how reliable these models are when they generate detailed texts.
What's the solution?
The solution was to create FACTORY, a set of prompts carefully verified by humans to ensure they require precise and factual answers. This set is more difficult and comprehensive than previous ones, allowing better testing of the models’ ability to produce truthful long-form content.
Why it matters?
This matters because improving how we measure factual accuracy in AI responses helps developers make more trustworthy and reliable language models. It ensures that when AI is used for tasks like writing reports or answering questions, the information it provides is more likely to be correct.
Abstract
FACTORY, a human-verified prompt set, evaluates the factuality of long-form responses from language models, revealing higher factual accuracy compared to existing datasets.