ETSformer: Exponential Smoothing Transformers for Time-Series Forecasting

Authors: Gerald Woo, Chenghao Liu, Donald Rose


TL;DR: We developed a new time-series forecasting model called ETSformer that leverages the power of two frameworks. By combining the classical intuition of seasonal-trend decomposition and exponential smoothing with modern transformers – as well as introducing novel exponential smoothing and frequency attention mechanisms – ETSformer achieves state-of-the-art performance.


Background

Before diving into our main discussion, let’s review a couple of important concepts that are at the core of our work. (For a review of other key terms, see our Glossary.)

Time-Series Forecasting

Time-series forecasting is concerned with the prediction of the future based on historical information, specifically for numerical data collected in a temporal or sequentially ordered manner. The accurate forecasting of such data yields many benefits across a variety of domains. For example, in e-commerce, the accurate forecasting of sales demand allows companies to optimize supply chain decisions, and create better pricing strategies.

AIOps is another important domain where forecasting plays an essential role. IT operations generate immense amounts of time-series data, and analyzing this trove of information with the help of artificial intelligence and machine learning can greatly boost operational efficiency.

Two AIOps tasks that benefit from more accurate forecasting are anomaly alerts and capacity planning:

  • Detecting anomalies and trigger notifications in IT operations is important because it allows us to quickly detect any faults and take immediate remedial action. Many proactive anomaly detection tools rely on accurate forecasting models, by creating alerts when the actual measurements deviate from the forecasted values by a large amount.
  • Capacity planning is another critical task in IT operations. By leveraging accurate forecasting models, smart algorithms can dynamically allocate cloud resources to active tasks, ensuring high availability to customers while minimizing the costs associated with these resources.

Time-Series Prior Knowledge: Exponential Smoothing & Decomposition

Exponential smoothing is a family of methods motivated by the idea that forecasts are a weighted average of past data (observations), and the weights decay (decrease) exponentially as we go further into the past (older data).

  • In other words, more recent data get weighted more highly than older data, reflecting the view that the more recent past should be considered more important or relevant for making new predictions or identifying current trends – a reasonable assumption.

Time-series data can contain a wide variety of patterns, of which trend and seasonality are two distinctive categories or components that many real-world time series exhibit. It is often helpful to split time-series data into such constituents, and this is known as time-series decomposition.

  • Decomposition of a time series into trend and seasonality components has played a key role in the development of forecasting methods – it allows us to analyze each component individually, leading to more accurate predictions, rather than having to analyze the raw time-series, which would otherwise be too complex.

This decomposition, along with an exponentially-weighted decay, are examples of incorporating prior knowledge of time-series structures into forecasting models, and the benefits of doing so are clear, given the popularity and forecasting prowess of these methods.


Problem: Time-Series Forecasting Methods Ignore Prior Knowledge

Now that we know why time-series forecasting is so important, you may be wondering: how do we actually forecast the future? In the age of big data, where we have access to copious amounts of time-series metrics (for example, minute-level measurements from a data center across the span of a year), simple statistical models no longer cut it. Instead, we look towards powerful machine learning and deep learning models, which can ingest these large amounts of data, learn the salient patterns, and make accurate long-term forecasts.

However, time-series data is usually noisy and fluctuating (non-stationary). In addition, existing approaches are too general and do not incorporate the proper prior knowledge about time-series structures (such as trend and seasonality). Existing approaches, such as general machine learning methods, may incorporate some form of prior knowledge, but it’s not time-series specialized. All of this can lead to suboptimal modeling of temporal patterns and inaccurate long-term forecasts.


Our Solution: Exponential Smoothing Transformers (ETSformer)

To address the limitations of existing methods, we propose a new method for time-series forecasting called ETSformer. You could think of our approach as “exponential smoothing transformers” – our model is essentially a transformer, extended with extra capabilities designed to tailor it to processing time-series information. Inspired by classical exponential smoothing methods, ETSformer combines their power with that of transformers to achieve state-of-the-art performance.

Since our new approach combines elements of two powerful techniques, the name embodies this fruitful combination: “ETS” comes from the extension of exponential smoothing methods to consider state space models (Error, Trend, and Seasonal) – and can also be thought of as an abbreviation for ExponenTial Smoothing – while “former” comes from transformer.

ETSformer in a nutshell: Transformers, Transformed

Our new approach:

  • Brings the time-tested ideas of seasonal-trend decomposition and exponential weighting into the modern transformers framework. Seasonality and trend are critical components of time-series data, and ETSformer bakes these time-series priors into the architecture of a transformer model.
  • This leads to forecasts that are a composition of human-interpretable level, growth, and seasonality components.
  • The result: a deep learning model that’s effective, efficient, and interpretable.

Deep Dive: How ETSformer Works

Generating Forecasts: Inspired by classical ETS methods

Figure 1 provides a visual overview of how ETSformer generates its forecasts:

  • Step 1: ETSformer first decomposes the input time-series into Seasonal and Trend components – the latter being a composition of the Level and Growth terms. The reason for this step: extracting intermediate representations lets us extrapolate (see next step) on a more granular level than the raw input, which ultimately results in more accurate forecasts.
  • Step 2: ETSformer now extrapolates the two intermediate components into the future. The Seasonal component extracts salient periodic patterns and extrapolates them. The Trend component first estimates the current level of the time-series, and subsequently adds a damped growth term to generate trend forecasts. (The damping is used to avoid overestimating future growth – in other words, to avoid being overly optimistic and assuming a trend will continue unabated.)
  • Step 3: The last step generates a final forecast through the composition or recombining of the (now-extrapolated) Seasonal and Trend components.

Figure 1. An overview of how ETSformer generates forecasts: first decomposition (red down-arrow) of input data into Seasonal and Trend patterns, then extrapolation of these two metrics, and finally composition (red up-arrow) recombines them into a final forecast horizon.


System Architecture: Transformer at its Core


Our system’s architecture is essentially a transformer, consisting of an encoder and a decoder, each of which plays a key role in the three main steps:

  • Decomposition: The encoder is responsible for taking in the time-series, and extracting the level, growth, and seasonality components from this input time-series.
  • Extrapolation: These components are then passed on to the decoder, which extrapolates them into the future.
  • Composition: Before exiting the decoder, these extrapolated components are fused into a single forecast of the future.

Figure 2: How ETSformer’s components operate. A lookback window (graph, bottom-center) is processed by the encoder-decoder transformer architecture (dark gray box) to produce a forecast (graph, upper right). The encoder comprises multiple layers; each performs seasonality, growth, and level extraction via our novel Frequency Attention, Exponential Smoothing Attention, and Level modules. The decoder comprises multiple G+S stacks; each performs extrapolation on the seasonality and growth components, via the Frequency Attention and Growth Damping modules (lower right).


The encoder performs seasonality, growth, and level extraction:

  • Seasonality Extraction via Frequency Attention: The encoder first extracts seasonality via our novel frequency attention mechanism, transforming the time-series representations into the frequency domain, where the salient frequencies corresponding to the seasonality present in the time-series are extracted.
  • Growth Extraction via Exponential Smoothing Attention: The deseasonalized time-series is passed to our novel exponential smoothing attention, which extracts the growth component.
  • Level Extraction via Level Module.

The decoder then performs extrapolation into the future:

  • Seasonality Extrapolation via Frequency Attention - Using our frequency attention mechanism, the extracted seasonality representations can be extrapolated into the future, as visualized in Figure 3.
  • Growth Extrapolation via Growth Damping - One key assumption in time-series forecasting is that growth may not persist linearly into the future, especially when forecasting across long time periods, which can lead to unrealistic predictions. The growth damping module addresses this, ensuring that growth can taper off in the future. The growth damping module learns from data whether to taper off or not, so the module gives the model the option to let growth forecasts taper off further into the future.

Results

Combining two different approaches into a new method that yields new insights or benefits is a time-honored technique in science, but what about this particular domain? Now that we’ve described the different components of ETSformer, and how they tie together, we’d like to show that our approach is not just a good idea in theory. Does combining classical exponential smoothing techniques with a transformer architecture actually show good performance via measurable results, proving the efficacy of our approach?

We’re happy to report that the answer is Yes! ETSformer proves the efficacy of its approach by achieving state-of-the-art performance over six real-world time-series datasets from a range of application domains – including traffic forecasting, weather forecasting, and financial time-series forecasting. Our method beats baselines in 22 out of 24 settings (based on the MSE – mean squared error – metric) across various real-world datasets, and across different forecasting lengths (how far ahead into the future the model forecasts). See our research paper for a more detailed explanation of our empirical results and comparisons with competing baselines.

Another positive result: ETSformer achieves interpretable decompositions of the forecasted quantities – exhibiting a clear trend and seasonal pattern, rather than noisy, inaccurate decompositions. As shown in Figure 3, given a time-series and the true underlying decompositions of seasonality and trend (we have this information on synthetic data), ETSformer can reconstruct these underlying components better than a competing method. ETSformer successfully forecasts interpretable level, trend (level + growth), and seasonal components, as observed in the trend and seasonal components closely tracking the ground truth patterns. In contrast, the competing approach, Autoformer, struggles to disambiguate between trend and seasonality.

Figure 3. Visualization of time-series forecasts on a synthetic dataset by ETSformer, compared to a baseline approach (Autoformer) and ground truth. Top: seasonality & trend (the two components combined, non-decomposed). Middle: trend component (decomposed). Bottom: seasonal component (decomposed). In all three cases, ETSformer matches ground truth better than Autoformer.

Impact

ETSformer's state-of-the-art performance provides evidence that combining ETS techniques with a transformer-based architecture can yield real-world benefits. Combining other classical methods with the power of transformers might bring equally good (perhaps even greater) benefits, and would seem like a fruitful avenue to explore in future research.

It's also important to note that ETSformer generates forecasts based on a composition of interpretable time-series components. This means we can visualize each component individually, and understand how seasonality and trend affects the forecasts. This interpretability is a key feature because, in general, we want the decisions or results of AI systems to be explainable to the greatest extent possible.

The Bottom Line

  • Time-series forecasting is the prediction of future values based on historical information, and has important implications across a wide range of scientific and business applications.
  • However, it’s a difficult task, due to the properties of time-series data. Two key issues: the high amount of noise, and fluctuation (non-stationary values) across time. Another common issue in modern forecasting architectures: not incorporating time-series structures as prior knowledge; this is a missed opportunity, since such knowledge can be useful for generating forecasts.
  • With our new method, ETSformer, we propose to rectify this latter issue by combining two great ideas in one: introducing the time-tested concepts of time-series decomposition and exponential smoothing into the modern, powerful framework of transformers.
  • ETSformer performs three main steps (decomposition, extrapolation, composition) to generate a final forecast. ETSformer’s encoder and decoder transform input data into output forecasts – extracting and extrapolating the level, growth, and seasonality components along the way.
  • ETSformer can construct interpretable, seasonal-trend decomposed forecasts, and proved its efficacy by achieving state-of-the-art performance across a range of time-series forecasting applications and datasets.
  • We have released our code to facilitate further research and industrial applications of ETSformer for time-series forecasting.

Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (see links below). Connect with us on social media and our website to get regular updates on this and other research projects.

About the Authors

Gerald Woo is a Ph.D. Candidate in the Industrial PhD Program at Salesforce Research. His research focuses on time-series modeling with deep learning.

Chenghao Liu is a Senior Applied Scientist at Salesforce Research Asia, working on AIOps research, including time series forecasting, anomaly detection, and causal machine learning.

Donald Rose is a Technical Writer at Salesforce AI Research. He earned his Ph.D. in Computer Science at UC Irvine, and specializes in content creation for multiple projects — such as blog posts, video scripts, newsletters, media/PR material, tutorials, and workshops. He enjoys helping researchers transform their work into publications geared to a wider audience and writing think pieces about AI.

Glossary

  • AIOps - Artificial Intelligence for IT Operations, the application of artificial intelligence, machine learning, and analytics to enhance the IT operations.
  • Anomaly detection - Identification of rare events or occurrences in an environment, or dataset. An anomaly can be defined as an observation which rarely occurs, or is inconsistent with the rest of the dataset.
  • Auto-scaling - A method used in cloud computing to dynamically adjust the amount of computational resources (physical machines, virtual machines, pods, etc.) assigned to a computational task.
  • Exponential smoothing - a family of methods motivated by the idea that forecasts are a weighted average of past data (observations), and the weights should decay (decrease) exponentially as we go further into the past (older data). In other words, more recent data should get weighted more highly than older data, reflecting the view that the more recent past should be considered more important or relevant for making new predictions or identifying current trends – a reasonable assumption.
  • Seasonality - A characteristic of time-series data. A time-series is said to contain seasonality if it has repeating patterns which occur at regular intervals. Example: temperature fluctuations in regions with continental climate (dropping in winter, rising in summer).
  • Transformer: a neural network that encodes an input sentence (sequence) into a vector, then decodes that vector into a different sequence. The overall effect: an input sequence is “transformed” into an output sequence. Note that the transformer can attend to previous states (refer to and use previous inputs) to help ensure optimal results of the transformation. Used most often in NLP applications. Transformer examples: T5; BERT.
  • Trend - A characteristic of time-series data. A time-series is said to contain trend when there is a long-term pattern of increasing or decreasing values. More complex trends are possible, such as an increase, followed by stagnation. Trend can be further broken down into level and growth components – where level is the average value over a time period, and growth is the change in value over that time period.