LiveTradeBench: Seeking Real-World Alpha with Large Language Models
Haofei Yu, Fenghai Li, Jiaxuan You
2025-11-06
Summary
This paper introduces a new way to test how well large language models (LLMs) can make decisions in real-world, constantly changing situations, specifically in financial markets.
What's the problem?
Current tests for LLMs, like quizzes or math problems, are static and don't reflect the uncertainty and dynamic nature of real-life decision-making. They check if an LLM *knows* something or can solve a problem, but not if it can consistently make good choices when things are unpredictable and changing over time, like when trading stocks.
What's the solution?
The researchers created 'LiveTradeBench,' a simulated trading environment that uses real-time market data and news. LLMs act as agents, observing market conditions and deciding how to allocate money across different investments. This setup tests their ability to manage risk, react to new information, and make consistent decisions in a live, evolving market, using both U.S. stock data and prediction markets.
Why it matters?
The results showed that LLMs that perform well on standard tests don't necessarily make good trading decisions. This highlights a significant gap between how we currently evaluate these models and how they perform in complex, real-world scenarios, emphasizing the need for better benchmarks that focus on sequential decision-making and adapting to live uncertainty.
Abstract
Large language models (LLMs) achieve strong performance across benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments--U.S. stocks and Polymarket prediction markets--differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.