Meet Merlion: An End-to-End Easy-to-Use Machine Learning Library for Time Series Applications

17 min read

AUTHORS: Huan Wang, Aadyot Bhatnagar, Doyen Sahoo, Wenzhuo Yang, Steven Hoi, Caiming Xiong, Donald Rose

TL;DR: Time series data is a critical source of insights for many applications, including IT Operations, Quality Management, Financial Analytics, and Inventory & Sales Management. While a variety of dedicated packages and software exist, engineers and researchers still face several daunting challenges when they try to experiment with or benchmark time-series analysis algorithms. The steep learning curve for disparate programming interfaces for different models - as well as the process of selecting and training a model, data compatibility requirements, and intricate evaluation metrics - limit the accessibility of such packages for a broad audience of potential users.

To address these issues, and combine several key functions into a single tool, we developed Merlion: a Python library for time series intelligence. Merlion provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance. It supports various time series learning tasks, including forecasting, anomaly detection, and change-point detection for both univariate and multivariate time series. This library helps solve a range of problems by providing engineers and researchers a one-stop solution to rapidly develop models for their specific time series needs, and benchmark them across multiple time-series datasets. Instead of having to learn and deploy multiple tools, you can do it all within a single, powerful framework.

Background: Review of Key Concepts

Before we dive into how Merlion works, let’s give some background and context by briefly explaining some key concepts related to what Merlion does, and why they are important.

Time Series Analysis

If one or more variables or data points (e.g., periodic measurements) are changing in value over time, you have a time series; if you want to study the causes or trends of these data changes over time, welcome to the world of time series analysis. Time series analysis involves answering key questions as one strives to understand the data changing over time, such as:

  • What trends in the data points can we spot, and which are most important?
  • Are any outside actors, events, or processes influencing the data, or causing the data to change?
  • Which factors are affecting variables, from one point to the next, over time? Which factors have the biggest influence?
  • Where is the data headed; what changes to the data are likely to come in the future (forecasting)?
  • Have any data points or variables entered an abnormal range (anomaly detection)? What’s causing the anomaly?

Some key terms to know:

  • Time series: a sequence of data points (observations) ordered in time, showing how variables are changing over time
  • Time series analysis: the study of what’s causing data points to change over time, identifying current or past trends in data, detecting anomalies in the data, forecasting where data values are headed in the future
  • Univariate time series: involves just one variable changing over time.
  • Multivariate time series: two or more time-dependent variables, each influenced by its past values and other variables.

The analysis process ideally starts by determining how each variable normally changes over time (the normal pattern), so that any anomaly (a deviation from the normal pattern) can be more easily detected.

Anomaly Detection

An anomaly is something out of the ordinary — a change in data readings that indicates a system is currently in an abnormal state (or about to go into it), or being influenced by a different process or outside factor.

The ability to spot an anomaly in time series data is very important in many different contexts. Examples include:

  • A signal that something is wrong with a device that is providing periodic data on its operational “health”
  • Finding this kind of anomaly can help you diagnose when and how the device behaved incorrectly, and what actions to take to fix the problem and eliminate the anomaly
  • An indication that an event has occurred which you need to take action on (for instance, a surprising reading from a camera or a sudden spike in heat measured near a door, suggesting the potential presence of an intruder).

On a graph of data, an anomaly could be a sudden spike away from a trend line and then a rapid return to that trend line. However, not all spikes or deviations from a trend line are anomalies. A spike might not be indicating any behavior that is abnormal or generated by a separate process that should be uncovered. In other words, it may be normal for data to deviate from a trend line; it depends on the dataset. For example, a regularly occurring spike in some data value may be a normal feature of a system, and nothing to be concerned about.

So how can you tell if a spike is normal, or abnormal? Something to be ignored, or something to take action on? How do you spot the anomaly in the first place? Anomaly detection can be like finding a needle in a haystack -- a difficult task, especially for humans — which is why AI and machine learning are ideal for tackling this task.


Forecasting is important, especially for businesses, in order to determine how a known quantity, or variable, will change in the future. For example, businesses track their sales data, and it is crucial to be able to forecast what future sales will be, in order to take corrective action if one’s model is forecasting a dip in sales (or forecasting a change in a factor that can influence sales — e.g., weather forecasts can influence expected crop yield if you sell farm produce).

In short, forecasting lets you predict what the future will hold (at least for some data), understand the causes behind the forecast, and take corrective action now in an attempt to change that forecast for the better (that is, prevent a negative outcome in the future from occurring, or reduce the negative effect -- or make a predicted positive effect even greater).


In machine learning, a model will feature both parameters (which are determined during learning) andhyperparameters, which are set before learning even begins. Hyperparameters aretunable parameters whose values are used to control the learning process, and can directly influence how successful the model training process is.  

Examples of hyperparameters include: the learning rate, the topology and size of a neural network, the k in k-nearest neighbors, the number of decision tree branches, and mini-batch size.

Analogy: if modeling test taking, one parameter would be test scores (determined during testing); hyperparameters would be the number of questions and how much time is given to finish the test (both of which are decided before the test begins).


In Merlion, automated machine learning (AutoML) is used for automated hyperparameter tuning and model selection. In other words, AutoML automates some aspects of machine learning, making life easier for researchers (one reason why Merlion is in high demand on GitHub).

Ensemble Learning

Merlion enables easy-to-use ensembles that combine the outputs of multiple models to achieve more robust performance.


Benchmarking is measuring the performance of a system (for example, the models you have rapidly developed using Merlion), along one or more metrics. With Merlion, you can benchmark models across multiple time series datasets.

The Problem: Companies Must Get Better at Prediction (System Uptime, Outages)

One of the most important tasks that organizations need to succeed at is predicting system availability. Accurately predicting the health of our systems, and identifying potential issues with those systems, is vitally important at Salesforce. Our company must run 24/7; constant uptime is in our DNA, an essential part of our brand. Hence, it’s crucial for us to predict when any of our systems might go down. Not taking steps to predict downtimes would expose the company to undue risk.

The Upside of Uptime: Anomaly Detection and Forecasting Benefit Business

To help the company accurately predict system availability, Salesforce employs the twin time-tested time-series techniques of forecasting and anomaly detection. For example, one key to maintaining system availability is being able to detect and forecast anomalies in real-time metrics such as CPU utilization, average paging time, and request rate.

In general, anomaly detection and forecasting offer a number of business benefits, including:

  • reduce the mean time to detect or remediate incidents
  • minimize service disruption
  • perform capacity planning for host machines
  • offer business insights.

The Method: Employ AIOps and Time Series Techniques

Given the main problem to be solved (accurately predicting system availability and potential future outages or other negative events), organizations need to determine which method(s) should be used to solve it. In this case, the problem falls under the domain of AIOps -- and, more specifically, time series analysis.

AIOps: Applying AI to Improve IT Operations

Artificial Intelligence for Operations, or AIOps, could be thought of as the practice of applying analytics and machine learning to big data in order to automate and improve IT operations; in other words, improving the operational efficiency of a company, using AI tools and techniques. AIOps employs a range of techniques to get results, one of the most important being time series analysis.

Time Series Analysis: Studying Variables that Change over Time, to Predict Future Values

System uptime depends on certain variables being in an acceptable range at any given time — a time series task. Hence, time series analysis is crucial to predicting system availability. Since it is so important, time series analytics (one of several techniques utilized in AIOps) has been widely adopted in Salesforce’s products.

Real-World Problems: Pain Points in Time Series Experimentation and Evaluation

While we have identified time series analysis as key to solving the problem of accurately predicting system availability, a number of subproblems arise when employing time series techniques in the real world. Let’s look at some examples.

Interfacing with Diverse Models and Datasets

Practitioners want to try a variety of algorithms, but in order to use them, a significant amount of effort is required just to understand the interface for each one. It is also challenging to conform diverse datasets to a standard format for ease of benchmarking algorithms. In addition, time-series applications generally require extensive pre-processing (resampling, alignment, aggregation, normalization, etc.) or post-processing (thresholding, alert suppressions, and normalization), which are also expensive and time-consuming.

Dealing with Diverse Metrics and Evaluation Pipelines

Time-series literature provides abundant metrics to evaluate the performance of models, and many of them are applicable in different application scenarios. However, some metrics are tricky to implement, and many academic evaluations may not even be applicable to real-world industrial application scenarios.

Choosing Models and Hyperparameters

Models used in time-series forecasting and anomaly detection often require expert knowledge of complex hyperparameters in order to use them effectively. Furthermore, different models have different pros and cons, and sometimes there is a need to combine multiple models together - yet many AI/ML tools are not designed to deal with such ensemble models.

Project Problems in Practice: What a Typical Time Series Application Project Looks Like

While working with different product teams, we found ourselves facing common issues across various projects. To better understand the difficulties faced in these industrial applications, let’s look at some of the steps involved in a typical application scenario:

  1. I am an engineer / data scientist / applied scientist working on projects related to anomaly detection on CPU metrics. My data only has a very limited number of labeled anomalies, so the evaluation is challenging. Also, the metrics or settings I am using to evaluate the algorithm are slightly different from those reported in academic papers.
  2. I would like to try open source time-series datasets, which contain more labels, to give me more confidence in my algorithm’s results - but there are too many of them, and they are not organized well enough for me to use easily.
  3. I would like to choose the best algorithm for my project, but it is a pain to implement the whole benchmarking pipeline against all popular industrial solutions on all datasets.
  4. I find the algorithms are sensitive to hyperparameters, but I’m not an expert in this. I’d love some AutoML features to help me choose the best hyperparameters (or let AutoML choose them automatically). If not, I need some default hyperparameters that work reasonably well.

The Upshot: Time Series Analytics is Useful, but Applying It in Practice can be Hard

Although many tools have been created for time-series analytics, it still takes a lot of background knowledge and substantial effort to build up a proper benchmarking environment that is compatible with most of the popular algorithms.

If only there was a way to make time series analysis easier to use, while retaining its power -- combining several standard machine learning methods into a single, accessible-to-all tool.

Wait — there is!

One Solution to Rule Them All:  Merlion, An Easy-to-Use All-in-one ML Time Series Tool

In order to address the issues discussed above, we collaborated with various product teams on building a tool that would be easy to use and combine several AI/ML methods in one, in order to help Salesforce achieve two broad goals:

  • Improve normal operations: increasing overall operational efficiency at the company,  and
  • Avoid abnormal operations: accurately predicting, detecting, and fixing errors; maintaining system availability.

In other words, we wanted an intelligent tool that would apply the techniques of AIOps and time series analysis to help increase upside potential (make normal operations function even better), while also forecasting downside scenarios where something might go wrong in order to help keep systems running and in a good state (reduce risk - avoid bad outcomes).

The result is the Merlion Repository: an easy-to-use machine learning library for time-series forecasting and anomaly detection. The goal of the Merlion repo is to provide a standardized experimentation platform that is accessible to anyone interested in time-series analytics. Through our Merlion repo, we simplify some of the most time-consuming and difficult time-series tasks so one can start experimenting on time series quickly and easily - often by writing just a few lines of code.

Merlion in Action: How it Benefits Salesforce (Two Real-World Applications)

While one of the goals of the open source Merlion project is to help any organization benefit from its powerful features, we didn’t just develop this framework for the research community at large; Salesforce uses and benefits from Merlion as well. The tool has helped the company in multiple areas, and we are confident this positive outcome will be repeated at other organizations who apply this multi-function tool to their own set of problems.

Here are just a couple of examples of how Merlion benefits Salesforce:

Application 1: Improving the Tool that Improves Performance of the Salesforce Platform

Warden AIOps is an Application Performance Management (APM) platform used by developers to address performance issues on the Salesforce platform. The overall goal of Warden AIOps is to detect and fix any issues that are negatively impacting performance before they affect our customers. Benefits include improving performance and availability for Salesforce customers and reducing fatigue for operators.

Two ways in which Merlion helps Warden AIOps:

  • Merlion has been used to benchmark various models and ensembles on metadata produced by Salesforce machines, such as database or application CPU utilization, and paging time.
  • Merlion models have been deployed in the production environment to monitor machines’ health and detect anomalies. Using Merlion, we are able to increase both the precision and recall of anomaly detection in the machine monitoring system of Warden AIOps. The time to detect incidents is also greatly reduced (find faster = fix faster).

Application 2: Proactive Throttle Prediction Improves Customer Experience

When a machine’s load reaches a certain limit, we must throttle some customers’ apps to reduce the load. Currently, customers get notified only after their apps are throttled. In this ongoing new effort, with the help of Merlion, we are forecasting possible resource overflow ahead of time (before throttling is enforced), so customers can be notified early.

Merlion’s Modules: Five Easy Pieces

We’ve seen that Merlion is helping Salesforce with crucial tasks like anomaly detection and forecasting. But how does the tool actually work? Let’s look beneath the hood, to see how Merlion is structured and what its main components are.

Merlion is an end-to-end tool, designed to let users handle all of the primary tasks in the machine learning pipeline, from start to finish, as shown by its five-layer modular architecture:

  • Data Layer
  • Can load a wide range of datasets
  • Interoperability (plays nice) with Pandas
  • Does common data pre-processing transforms
  • Models
  • Anomaly detection: deep models, forecasting-based methods, statistical methods
  • Forecasting: tree ensemble models, statistical methods
  • AutoML: automatic hyperparameter selection
  • Post-Processing
  • Calibration: interpretable anomaly scores
  • Thresholding: practical rules for noise reduction
  • Ensembles and Model Selection
  • Provide strong performance that is robust across diverse datasets
  • For both anomaly detection and forecasting
  • Evaluation Pipeline
  • Flexible pipelines to simulate evaluation settings
  • From offline prediction to live deployment with model re-training.

Making Time Series Experimentation Accessible to All

For any tool to be successful (that is, likely to get adopted and used to solve real-world problems), it must not only be powerful but relatively easy to use and understand as well. This is one aspect of Merlion that makes it stand out. Not only is it powerful, but it’s designed with some key features to help make it accessible to anyone interested in time series analytics:

Compact Code

One can start experimenting on time series by writing just a few lines of code. For example, one can train a default model for anomaly detection and make predictions in just 10 lines of code.

Easy-to-access Datasets

Users can import many different time-series datasets -- with just a single line of code.

Consistent APIs Across Models

Through consistent APIs, users may try out different algorithms while keeping their experimentation script unchanged. For example, the model initializations are almost the same for all models.

Add Pre-processing Procedures Before the Time Series is Fed into the Model

Users can add data pre-processing transforms such as difference, exponential moving average, moving percentile, and lag transforms to an anomaly detection model.

No single model can perform well across all time series and use cases, so it’s important to provide users the flexibility to choose from a broad suite of heterogenous models. Merlion does just that, implementing many diverse models for both anomaly detection and forecasting.

The algorithms that Merlion currently supports for anomaly detection include isolation forest, random cut forest (by AWS), spectral residual (by Microsoft), dynamic baseline, ZMS, variational autoencoder, deep auto-encoding Gaussian mixture model, deep point anomaly detector, LSTM-encoder-decoder-based anomaly detector, and simple statistical threshold. We also support forecast-based anomaly detectors.

For forecasting, Merlion supports ARIMA, SARIMA, Prophet (by Facebook), ETS, vector AR, random forest, gradient boosted tree, and LSTM, as well as our own homegrown MSES smoother.


Post-processing and Calibration

Merlion users have the ability to specify post-processing rules in the model configuration. For example, using Aggregated Alarms as the post rule for anomaly detection, this rule could be set to only fire an alarm if the raw anomaly score exceeds a specified threshold (e.g., 4-sigma, or 4 standard deviations from the mean), and suppress all subsequent alarms for a user-specified period after the first alarm (e.g., two hours). The purpose of alert suppression is to avoid generating alerts too often, which could lead to alert fatigue (example: customers who receive 10 alerts in 1 minute might start ignoring them).

By default, calibration is also enabled for all models - that is, the anomaly scores returned can be interpreted as z-scores (standard deviation units). In our current example, we set a detection threshold of 4 to specify that we would only like to generate an alarm for a 4-sigma or greater event.

Benchmarking and Simulation of a Deployment Environment

One of Merlion's key features is an evaluation pipeline that simulates the live deployment of a model on historical data. This enables you to compare models on the datasets relevant to them, under conditions they may encounter in a production environment. Our evaluation pipeline proceeds as follows:

  1. Train an initial model on recent historical training data (designated as the training split of the time series)
  2. At a regular interval (e.g., once per day), retrain the entire model on the most recent data. This can be either the entire history of the time series or a more limited window (e.g., four weeks).
  3. Obtain the model's predictions (anomaly scores or forecasts) for the time series values that occur between re-trainings. You may customize whether this should be done in batch (predicting all values at once), streaming (updating the model's internal state after each data point without fully re-training it), or some intermediate cadence.
  4. Compare the model's predictions against the ground truth (labeled anomalies for anomaly detection, or the actual time series values for forecasting), and report quantitative evaluation metrics.

Merlion has two quick benchmark scripts for anomaly detection and forecasting, respectively. Users can evaluate any model on any dataset.

Ensembles: Combining Multiple Models

Merlion users can also easily construct ensemble models from existing models. An ensemble model combines more than one model in order to improve predictive power (the ability to predict an outcome).

One example of using ensembles to solve data science problems is the random forest algorithm, which utilizes multiple CART models (the algorithm creates multiple CART trees and combines the predictions).

Comparison: Merlion vs. Other Libraries

The table below provides a visual overview of how Merlion's features compare to other libraries for time series anomaly detection and/or forecasting. Note how all 8 boxes (the entire set of 8 desired features listed in the far left column) are checked for Merlion, whereas none of the other tools (listed across the top of the table) have all of the boxes checked.

In short, if you want to reap the full benefits of employing all 8 of these important features in a single tool, Merlion is the only library to choose (and its popularity on GitHub shows that many people agree).

The Bottom Line

Researchers and engineers alike can rely on Merlion. By combining multiple useful functions in a single easy-to-use tool, Merlion provides a complete end-to-end solution for a wide range of machine learning time series tasks.

The Merlion framework provides several key benefits, including:

  • All-in-one design: comes with more built-in features/functions than other ML tools
  • A unified interface across all models and datasets
  • Pre- and post-processing layers
  • Anomaly score calibration to improve interpretability
  • AutoML for hyperparameter tuning and model selection
  • An evaluation framework that simulates model retraining
  • Support for ensembles (combining multiple models)
  • An easy-to-use visualization module
  • Can be done on-prem (with a local machine) with your own data
  • Increased security, since there is no need to share your data.

Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our mailing list to get regular updates on this and other research projects.

About the Authors

Huan Wang is a Research Director at Salesforce Research. He earned his Ph.D. degree in Computer Science from Yale, and received the best paper award at the Conference on Learning Theory (COLT) in 2012. At Salesforce, he works on deep learning theory, reinforcement learning, time series analytics, operational and data intelligence. Previously, he was a senior applied scientist at Microsoft AI Research, a research scientist at Yahoo, an adjunct professor at NYU’s School of Engineering teaching machine learning, and an adjunct professor at Baruch College teaching algorithm design.

Aadyot Bhatnagar is a Senior Research Engineer at Salesforce Research. He has broad research interests in machine learning, with prior works in speech, NLP, and computer vision, and he enjoys bridging the gap between research and production. Aadyot is the lead developer of the Merlion Repo.

Doyen Sahoo is a Senior Research Scientist at Salesforce Research Asia, working on AIOps research and development for enhancing operational efficiency at Salesforce. His research interests include machine learning, both fundamental and applied, online learning, and computer vision.

Wenzhuo Yang is a Senior Applied Researcher at Salesforce Research Asia, working on AIOps research and applied machine learning research, including causal machine learning, explainable AI, and recommender systems.

Steven C.H. Hoi is Managing Director of Salesforce Research Asia and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.

Caiming Xiong, a VP / Managing Director of AI Research at Salesforce, leads the effort to build state-of-the-art AI technologies, publish in top academic conferences, innovate, collaborate, and embed our work across Salesforce clouds to accelerate the building of AI products.

Donald Rose is a Technical Writer at Salesforce AI Research. Specializing in content creation and editing, Dr. Rose works on multiple projects, including blog posts, video scripts, news articles, media/PR material, social media, writing workshops, and more. He also helps researchers transform their work into publications geared towards a wider audience.


The Merlion repository is the outcome of a collaboration between Salesforce Research and several Salesforce product teams, including Monitoring Cloud, Warden AI, and Service Protection. Here is the full list of authors: Aadyot Bhatnagar, Paul Kassianik, Chenghao Liu, Tian Lan, Wenzhuo Yang, Rowan Cassius, Doyen Sahoo, Devansh Arpit, Sri Subramanian, Gerald Woo, Amrita Saha, Arun Kumar Jagota, Gokulakrishnan Gopalakrishnan, Manpreet Singh, K C Krithika, Sukumar Maddineni, Daeki Cho, Bo Zong, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Steven Hoi, and Huan Wang.

We would also like to thank Denise Perez, Donald Rose, Gang Wu, Feihong Wu, Vera Serdiukova, Zachary Taschdjian, and MJ Jones for their help in setting up the webpage and UX designs.

Prophet: Sean J. Taylor and Benjamin Letham, Forecasting at Scale.


Random Cut Forest: Guha, S., Mishra, N., Roy, G., & Schrijvers, O. (2016, June). Robust random cut forest-based anomaly detection on streams. In International conference on machine learning (pp. 2712-2721).

Isolation Forest: F. Liu, K. Ting, and Z. Zhou. 2008 Eighth IEEE International Conference on Data Mining, page 413--422. IEEE, (2008).