VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu

2024-09-03

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Summary

This paper talks about VisionTS, a new approach that uses visual data from images to predict time series data without needing extensive training on time series itself.

What's the problem?

Most existing methods for predicting time series data, like stock prices or weather patterns, either require a lot of specific data or rely heavily on traditional models that may not be flexible enough. This can make it hard to get accurate predictions, especially when the data comes from different sources or domains.

What's the solution?

The authors propose a new method called VisionTS that treats the task of predicting time series as similar to reconstructing images. They use a type of model called a visual masked autoencoder, which has been trained on a large dataset of images (ImageNet). Surprisingly, this model can make good predictions about time series data without needing extra training specific to that type of data. With just a little fine-tuning, VisionTS can achieve excellent results compared to other models designed specifically for time series forecasting.

Why it matters?

This research is important because it opens up new possibilities for using visual data in predicting trends over time. By showing that image-based models can effectively handle time series tasks, it encourages further exploration into how different types of data can be combined, which could lead to better forecasting tools in various fields like finance, healthcare, and environmental science.

Abstract

Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at https://github.com/Keytoyze/VisionTS.

View Paper