Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin

2024-09-25

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Summary

This paper introduces Time-MoE, a new model designed for forecasting time series data. It uses a technique called Mixture of Experts (MoE) to improve efficiency and performance, allowing the model to handle large amounts of data while keeping costs low.

What's the problem?

Time series forecasting, which involves predicting future values based on past data, has seen advancements in deep learning. However, existing models are often expensive to run and not scalable enough for real-world applications. This limits their effectiveness, as they struggle to handle the vast amounts of data that can be involved in time series analysis.

What's the solution?

To address this issue, the researchers developed Time-MoE, which combines multiple specialized sub-models (experts) that only activate when needed. This means that instead of using all parts of the model for every prediction, it only uses a few at a time, making it much more efficient. They trained this model on a massive dataset called Time-300B, which includes over 300 billion data points from various domains. Their approach allowed them to create a powerful forecasting model with 2.4 billion parameters that performs well without high costs.

Why it matters?

This research is important because it represents a significant step forward in time series forecasting technology. By making models more efficient and scalable, Time-MoE can be used in various fields such as finance, weather forecasting, and healthcare, where accurate predictions are crucial. This advancement could lead to better decision-making and improved outcomes in many real-world applications.

Abstract

Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

View Paper