FSNet Learns Deep Time-Series Forecasting Models On the Fly, Adapts to Nonstationary Environments

12 min read

AUTHORS: Chenghao Liu, Quang Pham, Doyen Sahoo, Donald Rose

TL;DR: Nonstationary data, which changes its statistical properties over time, can make time series forecasting difficult. Despite the recent success of deep learning techniques for time series forecasting tasks, these methods are not scalable for applications where data arrives sequentially in a stream. We developed a new method for deep time-series forecasting called FSNet (Fast and Slow Learning Network), which can learn deep forecasting models on the fly in a nonstationary environment, and can successfully handle concept drift issues arising from the dynamics of such an environment. Empirical studies on real and synthetic datasets validate FSNet’s efficacy and robustness.


Before diving into our main discussion, let’s review the important concepts that are at the core of our work. In this post, our main focus is on the problem of online and deep time-series forecasting. (For a more detailed exposition on time-series forecasting and its use cases, please check out our previous post on ETSformer.)

Deep Time-Series Forecasting

Time series is a sequence of observations over a certain period of time. Time series forecasting, predicting future values given historical records, plays a key role in various real-life problems, such as weather forecasts, energy consumption, system tracking and monitoring, and more.

While traditional methods enhanced by domain expertise provide a means to learn temporal patterns in a data-driven way, deep learning (DL) is also being applied in this area. Recently, with increasing data availability and computational resources, we have witnessed notable achievements in leveraging DL techniques for time series forecasting tasks, because DL provides some advantages. Compared to traditional forecasting methods, DL models alleviate the need for manual feature engineering and model design, and can learn hierarchical representations and more complex dependencies.

A Tale of Two Learning Methods: Offline (Batch) vs. Online (Incremental)

In many real-world applications, live time series data grows and evolves rapidly. This requires the forecasting model to update itself in a timely manner to avoid the concept drift issue.

However, deep learning models follow the traditional batch learning paradigm, which requires re-training the entire dataset when dealing with new training samples. This is a major issue, since such an inefficient approach is not scalable and practical for learning from continuous data streams.

Figure 1. An overview of the online learning framework. Instead of re-training from scratch at every time step where we receive new data points, online learning frameworks are designed to continuously update the model in an incremental way.

Unlike traditional offline learning paradigms, online learning is designed to learn models incrementally from data that arrives sequentially. Models can be updated instantly and efficiently via the online learner when new training data arrives, overcoming the drawbacks of traditional batch learning.

For example, in our cloud monitoring system, the forecasting model predicts CPU and memory usage for the next 24 hours. Such predictions can help decision makers dynamically allocate cloud resources in advance, to ensure high availability to customers while reducing the operational cost. If we observe new customer behaviors, the deployed forecasting model is inevitably required to adapt to this changing environment. Fortunately, with the help of online learning, the model can automatically and efficiently adapt to this new change – without the high cost (in both time and space) of offline re-training.

A Good Online Deep Time-Series Forecasting Model is Hard to Find: Challenges of Learning Deep Forecasters On the Fly

Now that we know the benefit of online learning for time series forecasting, you may wonder: can we make small changes to the optimizer of deep forecasting models to support online updates?

The answer is not so simple. We think training deep forecasters online remains challenging for two major reasons:

  • Slow convergence, hard to handle concept drift: First, naively training deep neural networks on data streams converges slowly, because the offline training benefits such as mini-batches or training for multiple epochs are unavailable. In addition, when concept drift happens, such a cumbersome model would require many more training samples to learn such new concepts. Overall, while they possess powerful representation learning capabilities, deep neural networks lack a mechanism to facilitate successful learning on data streams.
  • Inefficient learning of recurring patterns: Second, time series data often exhibit recurrent patterns where one pattern could become inactive and then re-emerge in the future. Since deep networks suffer from the catastrophic forgetting phenomenon, they cannot retain prior knowledge, resulting in inefficient learning of recurring patterns, further hindering the overall performance.

The upshot: online time-series forecasting with deep models presents a promising yet challenging problem. Can these challenges be overcome? Read on to find out (hint: yes!).

Our New Approach: FSNet (Fast and Slow Learning Network)

To address the above limitations, we developed FSNet (Fast and Slow Learning Network) - a new approach designed to forecast time series on the fly and handle nonstationary time series data.

FSNet in a Nutshell

Here are some of the main features and contributions of our FSNet framework:

  • Forecasts on the fly: Can handle streaming data and can forecast based on live time-series information
  • Adapts to both changing and repeating patterns: Augments the standard deep neural network with the capabilities to quickly adapt, to simultaneously deal with both abrupt changing patterns and repeating patterns in time series
  • Can hone its backbone: Improves on the slowly-learned backbone by dynamically balancing fast adaptation to recent changes and retrieving similar old knowledge. FSNet achieves this mechanism via an interaction between two complementary components: an adapter to monitor each layer's contribution to the loss, and an associative memory to support remembering, updating, and recalling repeating events.
  • Overcomes catastrophic forgetting: Deep neural networks tend to completely and abruptly forget previously learned information upon learning new information; this is not the case with FSNet. Thanks to its associative memory, FSNet always has previously learned knowledge to draw from.

Inspired by Continual Learning

One of the key insights/innovations in our approach is to reformulate online time series forecasting as an online, task-free, continual learning problem. Continual learning aims to balance the following two objectives:

  • Utilize past knowledge to facilitate fast learning of current patterns
  • Maintain and update already acquired knowledge.

We found that these two objectives closely match the aforementioned challenges of online forecasting with deep models, so we developed an efficient online time series forecasting framework inspired by the Complementary Learning Systems (CLS) theory, a neuroscience framework for continual learning. The CLS theory suggests that humans can continually learn thanks to the interactions between the hippocampus and the neocortex. Moreover, the hippocampus interacts with the neocortex to consolidate, recall, and update such experiences to form a more general representation, which supports generalization to new experiences.

Motivated by this fast-and-slow learning of the CLS theory in humans, FSNet applies it to Machine Learning – enhancing deep neural networks with a complementary component to support fast learning and adaptation for online time-series forecasting.

Deeper Dive: How FSNet Works

Key Elements: Adapter + Memory

Our new framework employs two important elements that warrant special attention:

  • A per-layer adapter models the temporal information between consecutive samples, which allows each intermediate layer to adjust itself more efficiently with limited data samples, especially when concept drift happens
  • An associative memory stores important, recurring patterns observed during training; when encountering repeating events, the adapter interacts with its memory to retrieve and update the previous actions to facilitate fast learning of such patterns.

Consequently, the adapter can model the change of temporal patterns to facilitate learning with concept drifts, while its interactions with the associative memory allows the model to quickly remember and continue to improve the learning of recurring patterns.

Note that FSNet does not explicitly detect concept drifts, but instead always improves the learning of current samples – no matter if they are generated from the current fixed distribution, gradually changed distribution, or even abruptly changed distribution.

Component Design Overview

Figure 2 gives an overview of FSNet’s components. It addresses the fast adaptation to abrupt changes by the per-layer adapter and facilitates learning of recurring patterns via a sparse associative memory interaction.

Figure 2. An overview of FSNet. (a) A standard TCN backbone (green) with (b) dilated convolution stacks (blue). (c) A stack of convolution filters (yellow). Each convolution filter in FSNet is equipped with an adapter and associative memory to facilitate fast adaptation to both old and new patterns by monitoring the backbone's gradient EMA.

Fast and Slow Learning

Let’s consider the online learning setting when FSNet encounters new data points.

Slow learning indicates the standard weight update for neural networks, indicated by the convolution filters in Figure 2(c) and the arrow on the right side. As discussed earlier, standard neural networks converge slowly when updated with only one sample at a time (the online streaming-data scenario).

Fast learning indicates the whole module (adapter + memory) as indicated in Figure 2(c) with the blue arrows on the left side, which directly generate the update rule for base model parameters.

Fast Adaptation Mechanism

Recent works have demonstrated a shallow-to-deep principle, where shallower networks can quickly adapt to the changes in data streams or learn more efficiently with limited data. Therefore, it is more beneficial to learn in such scenarios with a shallow network first and then gradually increase its depth.

Motivated by this, we propose to monitor and modify each layer independently to learn the current loss better. Specifically, we implement an adapter to map the layer's recent gradients to a set of smaller, more compact transformation parameters to adapt the deep neural networks.

In online training, because of the noise and nonstationarity of time series data, a gradient of a single sample can highly fluctuate and introduce noise to the adaptation coefficients. Therefore, we use the Exponential Moving Average (EMA) of the backbone's gradient to smooth out online training's noise and to capture the temporal information in time series.

Remembering Recurring Events with an Associative Memory

In time series, old patterns may reappear, and it is imperative to leverage our past actions to improve learning outcomes. We think it is necessary to learn repeating events to further support fast adaptations.

In FSNet, we use meta information to represent how we adapted to a particular pattern in the past; storing and retrieving the appropriate meta information could facilitate learning the corresponding pattern when they reappear in the future.

In particular, we implement an associative memory to store the meta information for the adaptation of repeating events encountered during learning. Since interacting with the memory at every step is expensive and susceptible to noise, we propose to trigger this interaction only when a substantial representation change happens. When a memory interaction is triggered, the adapter queries and retrieves the most similar transformations in the past via an attention read operation.

Empirical Results

Now that we have described each component of FSNet, let’s see how it holds up in some experiments on both synthetic data and real-world data for online time-series forecasting.

We are happy to report that FSNet achieves significant improvements over typical baselines on both synthetic and real-world datasets. It has the capability to deal with various types of concept drifts, and achieves fast as well as better convergence.

Figure 3. Evolution of the cumulative loss during training (smaller is better).

Figure 3 provides some details, showing the convergent behaviors on the considered methods. Interestingly, we observe that concept drifts are likely to happen in most datasets because of the loss curves' sharp peaks. Moreover, such drifts appear at the early stage of learning, mostly in the first 40% of data, while the remaining half of data are quite stationary.

This result shows that the traditional batch training is often too optimistic by only testing the model on the last data segment. We observed promising results of FSNet on most datasets, with significant improvements over the baselines.

In addition, we find ECL and Traffic datasets are more challenging since they include missing values and the values could vary significantly within and across dimensions. This result sheds light on the challenges of online time-series forecasting, and handling the above challenges can further improve its performance.

Potentially Big Impacts: Why FSNet’s Method Matters

The act of combining online learning and deep learning approaches into a new method is a promising yet challenging problem for time series forecasting. FSNet augments a neural network backbone with two key components:

  • An adapter for adapting to the recent changes; and
  • An associative memory to handle recurrent patterns.

So, FSNet should have a positive impact on the field of online deep time-series forecasting:

  • Overcomes the deep network limitation of slow convergence on data streams, caused by the invalidation of the offline training strategies such as mini-batches or training for multiple epochs. This issue becomes more severe when concept drift happens, since a cumbersome model would require many more training samples to learn such new concepts.
  • Overcomes the deep network limitation of catastrophic forgetting, caused by not retaining prior knowledge, which results in inefficient learning of recurring patterns. In contrast, FSNet does store previous experience (like us humans, it has a memory), which enables the model to learn recurring patterns efficiently.

Taking a step back to consider the “Big Picture”, it is possible that, as important as improving time series forecasting is, the FSNet research may ultimately have an even wider impact – and not just in one but in two fields of science:

  • Machine Learning: The FSNet approach of combining an associative memory and an adapter with deep neural networks could be a model for how future ML systems are built. For example, why let your deep network suffer from catastrophic forgetting when adding associative memory can reduce or eliminate this issue? If similar performance gains are observed in future DL models designed with FSNet’s deep-Net + memory + adapter architecture, perhaps this may herald a paradigm shift in deep learning model design. Associative memory might become standard in DL models, if it proves to overcome other limitations of DL, as it did for time series forecasting.
  • Human Learning: We noted how the design of FSNet was inspired by the Complementary Learning Systems (CLS) theory, a neuroscience framework for continual learning in humans. Perhaps the FSNet research may lead to inspiration and insights in the other direction, helping to improve models of how humans learn.

The Bottom Line

  • Recently, deep learning has achieved great success in time series forecasting tasks. The concept drift of time series data in real-life applications and the scalability of time series data have led to the study of online and deep time-series forecasting methods.
  • However, training deep forecasting methods online is challenging, since training deep neural networks on data streams converges slowly and deep neural networks suffer from the catastrophic forgetting phenomenon.
  • Our goal is to implement an online learning paradigm for deep neural networks for time series forecasting. To achieve this, we developed FSNet, which innovatively formulates online time-series forecasting as an online, task-free continual learning problem. We borrow the idea of continual learning in humans – addressing catastrophic forgetting with associative memory, and using a CLS system for consolidating old and new knowledge. FSNet addresses fast adaptation to abrupt changes via the per-layer adapter, and facilitates learning of recurring patterns via a sparse associative memory interaction.
  • FSNet has achieved promising performance across multiple real-world time-series datasets, and is robust to various types of concept drifts.
  • We have released our code to facilitate further research and industrial applications of FSNet for online and deep time-series forecasting.

Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (see links below). Connect with us on social media and our website to get regular updates on this and other research projects.

About the Authors

Chenghao Liu is a Senior Applied Scientist at Salesforce Research Asia, working on AIOps research, including time series forecasting, anomaly detection, and causal machine learning.

Quang Pham was an intern at Salesforce Research Asia, working on online time-series forecasting, and is currently a Ph.D. candidate at Singapore Management University’s School of Computing and Information Systems. His research interests include continual learning and deep learning.

Doyen Sahoo is a Senior Manager, AI Research at Salesforce Research Asia. Doyen leads several projects pertaining to AI for IT Operations or AIOps, working on both fundamental and applied research in the areas of Time-Series Intelligence, Causal Analysis, Log Analysis, End-to-end AIOps (Detection, Causation, Remediation), and Capacity Planning, among others.

Donald Rose is a Technical Writer at Salesforce AI Research, specializing in content creation and editing. He works on multiple projects — including blog posts, video scripts, newsletters, media/PR material, social media, and writing workshops. His passions include helping researchers transform their work into publications geared towards a wider audience, leveraging existing content in multiple media modes, and writing think pieces about AI.


  • Concept drift - Occurs when the statistical properties of the target variable change over time, in unforeseen ways. This causes problems for offline models because the predictions become less accurate as time passes.
  • Catastrophic forgetting - A characteristic of deep neural networks: the tendency to completely and abruptly forget previously learned information upon learning new information.
  • Continual learning - Also known as incremental learning, or life-long learning. The concept of learning a model from a sequence of tasks without forgetting knowledge obtained from preceding tasks, where the data in the old tasks are not available anymore when training on new data.
  • Complementary Learning Systems (CLS) - CLS suggests that the brain can achieve complex behaviors via two learning systems: the hippocampus and the neocortex. The hippocampus focuses on fast learning of pattern-separated representations of specific experiences. Through the memory consolidation process, the hippocampus’s memories are transferred to the neocortex over time to form a more general representation that supports long-term retention and generalization to new experiences. The two learning systems (fast and slow) always interact to facilitate both fast learning and long-term remembering. Our FSNet is inspired in part by the CLS model of human learning.