< Explain other AI papers

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni

2025-08-21

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Summary

This paper introduces FutureX, a new way to test how well AI language models, called LLM agents, can predict the future. It's a big, up-to-date test that checks if these AI agents can handle information that changes all the time, just like human experts do in fields like politics or finance. They tested many different AI models to see how they perform and found out where they struggle, like with fake websites or remembering if information is still current.

What's the problem?

Predicting the future is really hard for AI language models because it requires them to understand a lot of changing information, figure out what's important, and make guesses even when they're not totally sure. Currently, there isn't a good, reliable way to test how well AI agents can do this, especially because real-world information changes so quickly and it's tough to get accurate, timely answers. This makes it hard to improve these AI's prediction skills.

What's the solution?

To fix this, the researchers created FutureX, a large-scale test specifically designed to evaluate LLM agents on their future prediction abilities. FutureX is constantly updated with new information every day, and it uses an automated system to collect questions and answers, which helps prevent the AI from just memorizing the answers. They used this benchmark to test 25 different AI models, looking at how well they adapt and perform when faced with constantly changing data, and they also analyzed why these AI agents sometimes fail.

Why it matters?

This work is important because it gives us a real-world, constantly updated way to measure how good AI agents are at predicting the future. By providing this benchmark, the researchers aim to push the development of AI that can think and predict outcomes as well as experienced human professionals. This will help create more capable AI for complex tasks that require understanding trends and making informed predictions.

Abstract

Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents' failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.