TL;DR: The performance of existing time-series forecasting methods can degrade due to non-stationarity, where the statistical distribution of time-series data changes over time. Our new DeepTime method overcomes non-stationarity issues by leveraging a “forecasting as meta-learning” framework on deep time-index models. DeepTime achieves competitive accuracy on the long-sequence time-series forecasting benchmark, while also being the most efficient model. The end result: faster, more accurate forecasts; state-of-the-art performance; and a highly efficient method that could lower the carbon footprint of leveraging forecasting models at scale within enterprises.
Before diving into our main discussion, let’s review a few important concepts at the core of the work described in this blog post – which is, in a nutshell, about how to handle non-stationary data in time-series forecasting. (For a detailed look at time-series forecasting and its use cases, we encourage you to check out our previous post about ETSformer.)
A time-series is a series of data measurements over time – a sequential collection of numerical data, typically collected over regular time intervals. Some examples of time-series include total sales of a particular product from an e-commerce platform for each day, or the CPU utilization of a server in a data center recorded every minute.
The task of forecasting – predicting future values based on past values – is a critical task in many businesses and scientific applications. It allows us to predict future values of key indicators, which help us make important downstream decisions, such as how much inventory of a product to store, or how to better allocate resources in a data center.
As IT infrastructure becomes more advanced, this has boosted our ability to collect ever larger volumes of such data – at higher sampling rates, and over longer periods of time, yielding extremely long time-series datasets. While the ability to collect more data is usually an upside, especially when it comes to machine learning (ML), we shall see that collecting time-series data from a dynamic system with many moving parts can pose some challenges when we try to apply ML techniques – and, in particular, forecasting.
Our primary focus in the work presented in this post is the problem of non-stationarity in time-series forecasting. To understand what that means, consider these two opposite scenarios:
ML models often require large amounts of training data to perform well, whereas humans tend to learn new ideas faster and more efficiently. For example, humans can learn quickly from a small number of examples; a child who has seen several pictures of birds and cats will quickly be able to differentiate between them.
Meta-learning is a technique that aims to achieve the kind of quick learning exhibited by humans, by using an inner and outer learning loop paradigm:
This approach learns an initial model, which is meant to be a good starting point; start from this good initial model and then quickly adapt to any new tasks.
Many existing time-series methods belong to the family of historical-value models. These are models that take as input past observations of the time-series of interest, and predict the future values of that time-series.
Some classical historical-value models include ETS (ExponenTial Smoothing), which says that forecasts are weighted averages of past observations, where recent observations are weighted more importantly than older observations – and on the deep learning side, ETSformer, a forecasting method we introduced in a previous post that combines ideas from the classical ETS approach with the modern Transformer framework.
However, the class of methods which we will focus on in this post is time-index models. Rather than taking past observations as inputs, these models take as input a time-index feature (think minute-of-hour, day-of-the-week, etc.), and predict the value of the time-series at that time-index. Time-index models are trained on historical data, and perform forecasting by being queried over the future values.
Some classical examples of time-index models include Prophet, an open-source forecasting tool specialized for business forecasting, and Gaussian processes.
While collecting large amounts of data is something that we typically seek to do in machine learning workflows, over time, the system which generates this data may undergo some change. For example, as a product becomes more popular, it may generate significantly higher daily sales compared to previous years. Or, the CPU utilization patterns of a particular server may change significantly if it was assigned a different application to run.
This phenomenon results in a non-stationary time-series – where the patterns of the collected data change over time. This poses a problem when we try to apply machine learning on top of such data, since these ML techniques work best with identically distributed data (where the patterns observed in the data remain the same).
As the system undergoes change, two problems about the data cause our models to degrade – covariate shift and conditional distribution shift (as shown in Figure 1). A majority of existing methods are historical-value models, which suffer from these issues.
Figure 1. An illustration of the covariate shift and conditional shift problems. In the three distinct phases, covariate shift is illustrated by the average levels of the time-series shifting upwards, while conditional distribution shift is illustrated by the middle phase having a different pattern (an upward sloping trend), while the first and last phases have the same horizontal trend pattern.
Covariate shift occurs when the statistics of the time-series values change. Imagine, for example, that the average daily sales of hand sanitizer was 100 during 2019, but during the pandemic in 2020, the average daily sales of hand sanitizer shoots up to 1000! A model being used in this scenario would not know how to handle this, since it has never seen input values so large before.
Conditional distribution shift occurs when the process generating the data changes. Historical-value models attempt to predict future values based on past values, for example, before the pandemic, the daily sales of hand sanitizer was mostly static, if yesterday’s sales was 100, today’s sales would also be around 100. However, as the pandemic was building up and people started to realize the importance of hand sanitizer, the sales of today could be twice that of yesterday! This is a conditional distribution shift, which a static model trained on old data is not able to account for.
To address the limitations of existing methods, we propose a new method for non-stationary time-series forecasting called DeepTime.
Our approach extends the classical time-index models into the deep learning paradigm. With DeepTime, we are the first to introduce how to use deep time-index models for time-series forecasting, addressing problems inherent in long sequences of time-series data.
DeepTime leverages a novel meta-learning formulation of the forecasting task to overcome the issue of neural networks being too expressive (which results in overfitting the data).
This formulation also enables DeepTime to overcome the two problems of covariate shift and conditional distribution shift, which plague existing historical-value models.
The key to our new approach is the introduction of a novel “forecasting as meta-learning” framework for deep time-index models, which achieves two important outcomes:
While classical time-index methods manually specify the relationship between the time-index features and output values (e.g., linearly increasing over time, or even a periodic repeating pattern), we utilize deep time-index models, where we replace the pre-specified function with a deep neural network. This allows us to learn these relationships from data, rather than manually specifying them.
However, doing so naively leads to poor forecasts, as seen in Figure 2a. The reason: deep neural networks are too expressive (which leads to overfitting the data), and learning on historical data does not guarantee good forecasts.
We can overcome this problem by introducing a meta-learning formulation, which achieves better results (as shown in Figure 2b).
Figure 2. Two graphs that show ground truth (actual time-series values) and predictions made by a deep time-index model. Graph (a): A deep time-index model trained by simple supervised learning. Graph (b): A deep time-index model trained with a meta-learning formulation (our proposed approach). The region with “reconstruction” is the historical data used for training. Both methods manage to reconstruct the ground truth data with high accuracy. However, in the forecast region, the model trained with simple supervised learning performs poorly, whereas the model trained with meta-learning (our DeepTime approach) performs forecasting successfully.
Figure 3 gives an overview of the forecasting as meta-learning methodology:
Figure 3. An overview of DeepTime’s “forecasting as meta-learning” framework. Given a long time-series dataset (top), it is split into M tasks, each assumed to be locally stationary. Given a task, the lookback window (green points) is treated as the support set, which the model adapts to. The forecast horizon (blue points) is treated as the query set, which the model is evaluated on. The deep time-index model consists of the final layer, called the ridge regressor (green box), and the rest of it (blue box) which is treated as a feature extractor.
Our deep time-index model is instantiated as a deep neural network, which takes time-index values as inputs, and outputs the time-series value at that time-index. However, since deep neural networks are models with a large number of parameters to learn, performing meta-learning (which requires an inner and outer learning loop) on the whole model can be very slow and memory intensive. To address this, we’ve come up with a model architecture to truncate the training process:
With this formulation, DeepTime is able to overcome the issues of covariate shift and conditional distribution shift which arise for historical-value models in non-stationary environments. DeepTime first sidesteps the problem of covariate shift, since it takes time-index features as inputs, rather than the time-series values. Next, using the idea of adapting to locally stationary distributions, meta-learning adapts to the conditional distribution of each task, resolving the problem of conditional distribution shift.
Now that we have described the different components of DeepTime, and how it tackles the problem of non-stationary forecasting, let's see how it holds up in some experiments on both synthetic data and real-world data. Does this meta-learning formulation on deep time-index models really allow it to compete head to head with existing methods, and how does its efficiency compare?
Figure 4. Predictions of DeepTime on three unseen functions for each function class. The orange dotted line represents the split between the lookback window and forecast horizon.
On synthetic data, DeepTime is able to extrapolate on unseen functions, containing new patterns which it has not been given access to in training data. Visualized in Figure 4, DeepTime was trained on three families of sequences – linear patterns, cubic patterns, and sum of sinusoids. When it was presented with new patterns which it had not seen before (before the orange dotted line), it was able to extrapolate the ground truth patterns accurately (after the orange dotted line)!
On six real-world time-series datasets across a range of application domains and different forecast horizons, DeepTime achieves state-of-the-art performance on 20 out of 24 settings (based on mean squared error metric)! DeepTime also proves to be highly efficient, beating all existing baselines in both memory and running time cost.
See our research paper for a more detailed explanation of our empirical results, including a table that shows comparisons with several competing baselines.
DeepTime's use of ridge regression helps ensure that predicted values are closer to the actual values, and enables our framework to obtain an exact one-step solution rather than an approximate iterative solution. This is one of the computational impacts of DeepTime: it represents a better way to come up with solutions in the time-series forecasting domain. In the DeepTime framework, we can get exact estimates – the actual values (solution) of the problem. In other words, the problem is tractable. In contrast, most existing methods use an iterative approach that can only ensure estimated values are close to the actual values; the numerical solution it finds is still only approximate. Approximate estimates means there is no guarantee of obtaining the actual values (solution) of a problem.
In short, one of the primary benefits of DeepTime is that we now have a time-series forecasting method that is faster and more accurate than other methods, and ultimately more useful.
Turning to the economic and business impacts, enabling more accurate predictions means DeepTime can provide more accurate forecasts that lead to better downstream decisions, such as resource allocation (when used for sales forecasting) or data center planning.
In addition, our method’s superior efficiency over existing computationally-heavy deep learning methods could lower the carbon footprint of leveraging forecasting models in enterprises. In the age of information overload and Big Data, where enterprises are interested in forecasting hundreds-of-thousands to millions of time-series, large models that require more computation lead to magnified power consumption at such scale, compared to more efficient models.
Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (see links below). Connect with us on social media and our website to get regular updates on this and other research projects.
Gerald Woo is a Ph.D. candidate in the Industrial Ph.D. Program at Singapore Management University and a researcher at Salesforce Research Asia. His research focuses on deep learning for time-series, including representation learning and forecasting.
Chenghao Liu is a Senior Applied Scientist at Salesforce Research Asia, working on AIOps research, including time series forecasting, anomaly detection, and causal machine learning.
Donald Rose is a Technical Writer at Salesforce AI Research, specializing in content creation and editing for multiple projects — including blog posts, video scripts, newsletters, media/PR material, social media, and writing workshops. His passions include helping researchers transform their work into publications geared towards a wider audience, leveraging existing content in multiple media modes, and writing think pieces about AI.