Evaluating LLMs on Real-World Forecasting Against Human Superforecasters

Janna Lu

2025-07-08

Evaluating LLMs on Real-World Forecasting Against Human Superforecasters

Summary

This paper talks about evaluating large language models (LLMs) on how well they can forecast or predict real-world events compared to expert human forecasters known as superforecasters. It shows that, although LLMs perform well by some scoring measures, they still don’t match the accuracy of these expert humans.

What's the problem?

The problem is that even the best current AI models, while good at understanding language, have trouble making highly accurate predictions about future events like experienced human forecasters can. This gap makes it hard to rely fully on AI for important forecasting tasks.

What's the solution?

The researchers tested state-of-the-art LLMs on real forecasting challenges alongside human superforecasters. They used rigorous scoring methods and found that, despite LLMs scoring better than average human crowds, they still lag behind superforecasters who have a track record of very precise predictions.

Why it matters?

This matters because forecasting is crucial for decision-making in many areas like economics, politics, and health. Understanding the limits of current AI helps guide future research to improve these models so they can better help humans make smarter, more reliable predictions.

Abstract

State-of-the-art large language models underperform human superforecasters in forecasting accuracy despite achieving Brier scores that surpass the human crowd.

View Paper