DeepTime: Using Deep Time-Index Meta-Learning to Improve Non-Stationary Time-Series Forecasting

13 min read

TL;DR: The performance of existing time-series forecasting methods can degrade due to non-stationarity, where the statistical distribution of time-series data changes over time. Our new DeepTime method overcomes non-stationarity issues by leveraging a “forecasting as meta-learning” framework on deep time-index models. DeepTime achieves competitive accuracy on the long-sequence time-series forecasting benchmark, while also being the most efficient model. The end result: faster, more accurate forecasts; state-of-the-art performance; and a highly efficient method that could lower the carbon footprint of leveraging forecasting models at scale within enterprises.


Background

Before diving into our main discussion, let’s review a few important concepts at the core of the work described in this blog post – which is, in a nutshell, about how to handle non-stationary data in time-series forecasting. (For a detailed look at time-series forecasting and its use cases, we encourage you to check out our previous post about ETSformer.)

Time-Series Forecasting

A time-series is a series of data measurements over time – a sequential collection of numerical data, typically collected over regular time intervals. Some examples of time-series include total sales of a particular product from an e-commerce platform for each day, or the CPU utilization of a server in a data center recorded every minute.

The task of forecasting – predicting future values based on past values – is a critical task in many businesses and scientific applications. It allows us to predict future values of key indicators, which help us make important downstream decisions, such as how much inventory of a product to store, or how to better allocate resources in a data center.

As IT infrastructure becomes more advanced, this has boosted our ability to collect ever larger volumes of such data – at higher sampling rates, and over longer periods of time, yielding extremely long time-series datasets. While the ability to collect more data is usually an upside, especially when it comes to machine learning (ML), we shall see that collecting time-series data from a dynamic system with many moving parts can pose some challenges when we try to apply ML techniques – and, in particular, forecasting.

Non-Stationarity: When Time Series Changes Over Time

Our primary focus in the work presented in this post is the problem of non-stationarity in time-series forecasting. To understand what that means, consider these two opposite scenarios:

  • Stationarity refers to time series data values that stay within a range, as well as regularity in the time series statistical patterns. That is, the statistical information of the time series data (such as the mean or variance) remains unchanged for stationary time series.
  • Non-stationarity, in contrast, is a phenomenon where the statistical distribution for time-series data does not stay stationary. The state of non-stationary time series means the data values and the statistical information of the data shifts over time – the variance, the mean, the standard deviation, any of these quantities may change if the data is non-stationary.

Meta-Learning: Learning to Learn, Faster

ML models often require large amounts of training data to perform well, whereas humans tend to learn new ideas faster and more efficiently. For example, humans can learn quickly from a small number of examples; a child who has seen several pictures of birds and cats will quickly be able to differentiate between them.

Meta-learning is a technique that aims to achieve the kind of quick learning exhibited by humans, by using an inner and outer learning loop paradigm:

  • The inner learning loop learns very quickly from a small set of examples, called the support set.
  • The outer learning loop ensures that the inner loop can perform this fast adaptation on new support sets. This is done by being trained on a query set - a set containing similar but distinct examples from the initial support set.

This approach learns an initial model, which is meant to be a good starting point; start from this good initial model and then quickly adapt to any new tasks.

Time-Series Forecasting Methods: Historical-Value and Time-Index Models

Many existing time-series methods belong to the family of historical-value models. These are models that take as input past observations of the time-series of interest, and predict the future values of that time-series.

Some classical historical-value models include ETS (ExponenTial Smoothing), which says that forecasts are weighted averages of past observations, where recent observations are weighted more importantly than older observations – and on the deep learning side, ETSformer, a forecasting method we introduced in a previous post that combines ideas from the classical ETS approach with the modern Transformer framework.

However, the class of methods which we will focus on in this post is time-index models. Rather than taking past observations as inputs, these models take as input a time-index feature (think minute-of-hour, day-of-the-week, etc.), and predict the value of the time-series at that time-index. Time-index models are trained on historical data, and perform forecasting by being queried over the future values.

Some classical examples of time-index models include Prophet, an open-source forecasting tool specialized for business forecasting, and Gaussian processes.

Problem: Long Sequences = Non-stationarity = Poor Performance

While collecting large amounts of data is something that we typically seek to do in machine learning workflows, over time, the system which generates this data may undergo some change. For example, as a product becomes more popular, it may generate significantly higher daily sales compared to previous years. Or, the CPU utilization patterns of a particular server may change significantly if it was assigned a different application to run.

This phenomenon results in a non-stationary time-series – where the patterns of the collected data change over time. This poses a problem when we try to apply machine learning on top of such data, since these ML techniques work best with identically distributed data (where the patterns observed in the data remain the same).

As the system undergoes change, two problems about the data cause our models to degrade – covariate shift and conditional distribution shift (as shown in Figure 1). A majority of existing methods are historical-value models, which suffer from these issues.

Figure 1. An illustration of the covariate shift and conditional shift problems. In the three distinct phases, covariate shift is illustrated by the average levels of the time-series shifting upwards, while conditional distribution shift is illustrated by the middle phase having a different pattern (an upward sloping trend), while the first and last phases have the same horizontal trend pattern.

Covariate shift occurs when the statistics of the time-series values change. Imagine, for example, that the average daily sales of hand sanitizer was 100 during 2019, but during the pandemic in 2020, the average daily sales of hand sanitizer shoots up to 1000! A model being used in this scenario would not know how to handle this, since it has never seen input values so large before.

Conditional distribution shift occurs when the process generating the data changes. Historical-value models attempt to predict future values based on past values, for example, before the pandemic, the daily sales of hand sanitizer was mostly static, if yesterday’s sales was 100, today’s sales would also be around 100. However, as the pandemic was building up and people started to realize the importance of hand sanitizer, the sales of today could be twice that of yesterday! This is a conditional distribution shift, which a static model trained on old data is not able to account for.

Time for DeepTime: Our “Deep Time-Index Meta-Learning” Solution

To address the limitations of existing methods, we propose a new method for non-stationary time-series forecasting called DeepTime.

Our approach extends the classical time-index models into the deep learning paradigm. With DeepTime, we are the first to introduce how to use deep time-index models for time-series forecasting, addressing problems inherent in long sequences of time-series data.

DeepTime leverages a novel meta-learning formulation of the forecasting task to overcome the issue of neural networks being too expressive (which results in overfitting the data).

This formulation also enables DeepTime to overcome the two problems of covariate shift and conditional distribution shift, which plague existing historical-value models.

Deeper Dive

The key to our new approach is the introduction of a novel “forecasting as meta-learning” framework for deep time-index models, which achieves two important outcomes:

  • Enables deep time-index models to effectively learn the relationship between time-index and time-series values, directly from data
  • Overcomes the problems of covariate shift and conditional distribution shift to excel on non-stationary time-series forecasting.

How DeepTime Works: A Closer Look

While classical time-index methods manually specify the relationship between the time-index features and output values (e.g., linearly increasing over time, or even a periodic repeating pattern), we utilize deep time-index models, where we replace the pre-specified function with a deep neural network. This allows us to learn these relationships from data, rather than manually specifying them.

However, doing so naively leads to poor forecasts, as seen in Figure 2a. The reason: deep neural networks are too expressive (which leads to overfitting the data), and learning on historical data does not guarantee good forecasts.

We can overcome this problem by introducing a meta-learning formulation, which achieves better results (as shown in Figure 2b).

Figure 2. Two graphs that show ground truth (actual time-series values) and predictions made by a deep time-index model. Graph (a): A deep time-index model trained by simple supervised learning. Graph (b): A deep time-index model trained with a meta-learning formulation (our proposed approach). The region with “reconstruction” is the historical data used for training. Both methods manage to reconstruct the ground truth data with high accuracy. However, in the forecast region, the model trained with simple supervised learning performs poorly, whereas the model trained with meta-learning (our DeepTime approach) performs forecasting successfully.

Forecasting as Meta-Learning

Figure 3 gives an overview of the forecasting as meta-learning methodology:

  • Our proposed framework tackles non-stationarity via the locally stationary distribution assumption – that is, although the long sequence to be non-stationary, we may assume that closeby time steps still have the same patterns and follow the same distribution, and slowly change across time.
  • Thus, we can split a long time-series into segments (called tasks), which we assume to be stationary.
  • In each task, the time-series are again split into a lookback window (the historical data), and the forecast horizon (the values which we want to predict).
  • In our meta-learning framework, we treat the lookback window as the support set, and the forecast horizon as the query set. This means that we want our model to quickly adapt to values in the lookback window, before extrapolating across the forecast horizon.

Figure 3. An overview of DeepTime’s “forecasting as meta-learning” framework. Given a long time-series dataset (top), it is split into M tasks, each assumed to be locally stationary. Given a task, the lookback window (green points) is treated as the support set, which the model adapts to. The forecast horizon (blue points) is treated as the query set, which the model is evaluated on. The deep time-index model consists of the final layer, called the ridge regressor (green box), and the rest of it (blue box) which is treated as a feature extractor.

Efficient Meta-Learning

Our deep time-index model is instantiated as a deep neural network, which takes time-index values as inputs, and outputs the time-series value at that time-index. However, since deep neural networks are models with a large number of parameters to learn, performing meta-learning (which requires an inner and outer learning loop) on the whole model can be very slow and memory intensive. To address this, we’ve come up with a model architecture to truncate the training process:

  • As seen in Figure 3, the deep time-index model is separated into two parts, the final layer (ridge regressor), and the rest of the model (feature extractor).
  • The key idea is to only apply the inner loop adaptation step of meta-learning on the final layer (ridge regressor), which can be efficiently computed during training.

With this formulation, DeepTime is able to overcome the issues of covariate shift and conditional distribution shift which arise for historical-value models in non-stationary environments. DeepTime first sidesteps the problem of covariate shift, since it takes time-index features as inputs, rather than the time-series values. Next, using the idea of adapting to locally stationary distributions, meta-learning adapts to the conditional distribution of each task, resolving the problem of conditional distribution shift.

Results

Now that we have described the different components of DeepTime, and how it tackles the problem of non-stationary forecasting, let's see how it holds up in some experiments on both synthetic data and real-world data. Does this meta-learning formulation on deep time-index models really allow it to compete head to head with existing methods, and how does its efficiency compare?

Figure 4. Predictions of DeepTime on three unseen functions for each function class. The orange dotted line represents the split between the lookback window and forecast horizon.

On synthetic data, DeepTime is able to extrapolate on unseen functions, containing new patterns which it has not been given access to in training data. Visualized in Figure 4, DeepTime was trained on three families of sequences – linear patterns, cubic patterns, and sum of sinusoids. When it was presented with new patterns which it had not seen before (before the orange dotted line), it was able to extrapolate the ground truth patterns accurately (after the orange dotted line)!

On six real-world time-series datasets across a range of application domains and different forecast horizons, DeepTime achieves state-of-the-art performance on 20 out of 24 settings (based on mean squared error metric)! DeepTime also proves to be highly efficient, beating all existing baselines in both memory and running time cost.

See our research paper for a more detailed explanation of our empirical results, including a table that shows comparisons with several competing baselines.

Impacts: Why DeepTime Matters

DeepTime's use of ridge regression helps ensure that predicted values are closer to the actual values, and enables our framework to obtain an exact one-step solution rather than an approximate iterative solution. This is one of the computational impacts of DeepTime: it represents a better way to come up with solutions in the time-series forecasting domain. In the DeepTime framework, we can get exact estimates – the actual values (solution) of the problem. In other words, the problem is tractable. In contrast, most existing methods use an iterative approach that can only ensure estimated values are close to the actual values; the numerical solution it finds is still only approximate. Approximate estimates means there is no guarantee of obtaining the actual values (solution) of a problem.

In short, one of the primary benefits of DeepTime is that we now have a time-series forecasting method that is faster and more accurate than other methods, and ultimately more useful.

Turning to the economic and business impacts, enabling more accurate predictions means DeepTime can provide more accurate forecasts that lead to better downstream decisions, such as resource allocation (when used for sales forecasting) or data center planning.

In addition, our method’s superior efficiency over existing computationally-heavy deep learning methods could lower the carbon footprint of leveraging forecasting models in enterprises. In the age of information overload and Big Data, where enterprises are interested in forecasting hundreds-of-thousands to millions of time-series, large models that require more computation lead to magnified power consumption at such scale, compared to more efficient models.

The Bottom Line

  • Improvements in IT infrastructure have led to the collection of longer sequences of time-series data.
  • However, these long sequences of data are susceptible to non-stationarity – a scenario where the environment that generates the data undergoes some change and the patterns change across time. Non-stationarity is a challenging task for existing time-series forecasting methods, due to covariate shift and conditional distribution shift problems.
  • With our new approach, DeepTime, we propose to solve this issue by leveraging deep time-index models and a meta-learning formulation of the forecasting task. Time-index models sidestep the problem of covariate shift by taking time-index as inputs, and the meta-learning formulation adapts the model to the current locally stationary distribution.
  • DeepTime has achieved state-of-the-art performance across multiple real-world time-series datasets, and is highly efficient compared to many modern baselines.
  • One of the primary benefits of DeepTime is its ability to come up with faster, more accurate forecasts, which lead to better downstream decisions, such as resource allocation (when used for sales forecasting) or data center planning. Plus, our method’s superior efficiency over existing computationally-heavy deep learning methods could lower the carbon footprint of leveraging forecasting models at scale.
  • We have released our code to facilitate further research and industrial applications of DeepTime for time-series forecasting.

Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (see links below). Connect with us on social media and our website to get regular updates on this and other research projects.

About the Authors

Gerald Woo is a Ph.D. candidate in the Industrial Ph.D. Program at Singapore Management University and a researcher at Salesforce Research Asia. His research focuses on deep learning for time-series, including representation learning and forecasting.

Chenghao Liu is a Senior Applied Scientist at Salesforce Research Asia, working on AIOps research, including time series forecasting, anomaly detection, and causal machine learning.

Donald Rose is a Technical Writer at Salesforce AI Research, specializing in content creation and editing for multiple projects — including blog posts, video scripts, newsletters, media/PR material, social media, and writing workshops. His passions include helping researchers transform their work into publications geared towards a wider audience, leveraging existing content in multiple media modes, and writing think pieces about AI.

Glossary

  • Non-stationarity: Describes a system that has non-stationary time-series data.
  • Non-stationary: A characteristic of time-series data. A time-series is said to be non-stationary when its statistical distribution changes or shifts over time.
  • Locally stationary: An assumption made regarding long, non-stationary sequences. Contiguous subsequences are assumed to be stationary, meaning that the statistical distribution does not change when shifted within that subsequence. In other words, a time-series sequence may be globally non-stationary, yet locally stationary.
  • Expressivity: Refers to how the architectural properties of a neural network (depth, width, layer type) affect the resulting functions it can compute, and its ensuing performance. In other words, the term typically refers to what kinds of functions or data the model can fit. A neural network is more expressive compared to a simple linear model, but being too expressive is not desired because it leads to overfitting the data, which means the learned model will not be general enough (won't perform well on real-world data that wasn't seen during training).
  • Ridge regression: A model tuning method, which helps ensure that a model's predicted values are closer to the actual values. It enables our framework to obtain exact estimation with a one-step solution rather than an approximate estimation with an iterative solution. The end result: a framework that makes faster and more accurate predictions. Note: Exact estimate means we can get the actual values (solution) of the problem. In other words, the problem is tractable. Approximate estimate means there is no guarantee of obtaining the actual values (solution) of a problem. Most existing methods use an iterative approach that can only ensure estimated values are close to the actual values, but the numerical solution it finds is still only approximate.