QuitoBench: A High-Quality Open Time Series Forecasting Benchmark
Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu
2026-04-02
Summary
This paper introduces a new benchmark called QuitoBench for evaluating how well different models can predict future values in time series data, which is important for things like financial forecasting, healthcare predictions, and managing cloud resources.
What's the problem?
Currently, it's hard to compare time series forecasting models effectively because there aren't enough large, well-organized collections of data that represent the different kinds of patterns you see in real-world time series. Existing datasets often focus on specific applications instead of the underlying characteristics that make forecasting difficult, like trends, seasonality, and how predictable the data actually is.
What's the solution?
The researchers created QuitoBench using a massive dataset of traffic data from Alipay, a payment platform. This dataset contains a billion data points and is organized to cover eight different combinations of time series characteristics. They then tested ten different forecasting models – some based on traditional statistics, some using deep learning, and some using newer 'foundation models' – on over 230,000 different forecasting scenarios within QuitoBench. By analyzing the results, they identified key trends like how different models perform with short versus long prediction windows, and how much the inherent predictability of the data impacts accuracy.
Why it matters?
This work is important because it provides a standardized way to evaluate and compare time series forecasting models. The findings show that foundation models excel at long-term predictions, but deep learning models can be more efficient with fewer parameters. It also highlights that having more training data is often more beneficial than simply making the model bigger, and it gives researchers a valuable resource for developing and testing new forecasting techniques.
Abstract
Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting with coverage across eight trendtimesseasonalitytimesforecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon Quito, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context (L=96) but foundation models dominate at long context (L ge 576); (ii) forecastability is the dominant difficulty driver, producing a 3.64 times MAE gap across regimes; (iii) deep learning models match or surpass foundation models at 59 times fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.