The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong, Chris Alberti, Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, Yonatan Bitton, Adam Bloniarz, Aijun Bai, Andrew Wang, Anfal Siddiqui, Arturo Bajuelos Castillo, Aviel Atias, Chang Liu, Corey Fry, Daniel Balle, Deepanway Ghosal, Doron Kukliansky
2025-12-12
Summary
This paper introduces a new way to test how well language models, like the ones powering chatbots, stick to the truth when generating text. It's called the FACTS Leaderboard, and it's designed to give a complete picture of a model's factual accuracy.
What's the problem?
Large language models are getting really good at *sounding* confident, but they often make things up or get facts wrong. Existing methods for checking their accuracy are often limited to specific types of questions or don't fully capture how a model performs in different real-world situations. There was a need for a more comprehensive and reliable way to evaluate if these models are actually truthful.
What's the solution?
The researchers created the FACTS Leaderboard, which isn't just one test, but four different tests combined into one overall score. These tests check factuality in different ways: answering questions about images, recalling general knowledge, using search engines to find information, and making sure long answers are supported by provided documents. Importantly, these tests use other AI models to automatically score the responses, making the process faster and more consistent. The leaderboard is publicly available so anyone can submit models and see how they compare.
Why it matters?
Ensuring language models are factually accurate is crucial as they become more integrated into our lives. If these models are used for things like providing news or medical advice, incorrect information could have serious consequences. The FACTS Leaderboard provides a standardized and rigorous way to measure and improve the reliability of these models, ultimately helping to build more trustworthy AI systems.
Abstract
We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .