SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan Das

2025-09-10

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Summary

This paper introduces a new, improved way to test how well large language models, like the ones powering chatbots, stick to the facts when answering short questions. It's called SimpleQA Verified and aims to be a more trustworthy test than previous methods.

What's the problem?

Existing tests for factuality in language models, specifically OpenAI’s SimpleQA, had some serious flaws. Many of the questions or answers were labeled incorrectly, the questions focused too much on certain topics and not enough on others, and there were a lot of repeated questions. This made it hard to get a clear picture of how accurate these models actually are.

What's the solution?

The researchers created SimpleQA Verified by carefully cleaning up the original dataset. They removed duplicate questions, made sure the questions covered a wider range of topics, and double-checked that the correct answers were clearly identified. They also improved the way the system automatically checks the answers. They then used this new test to see how well different models, including Gemini 2.5 Pro and GPT-5, performed.

Why it matters?

Having a reliable way to measure factuality is crucial because large language models sometimes 'hallucinate' – meaning they make things up that aren't true. SimpleQA Verified gives researchers a better tool to track improvements in these models and push them to be more accurate, ultimately leading to more trustworthy AI systems.

Abstract

We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.

View Paper