TL;DR: PyRCA is an open-source machine learning library specifically designed for conducting Root Cause Analysis (RCA) in IT operations. It offers a comprehensive framework that allows users to easily identify the complicated metric causal dependencies and automatically locate the root causes of incidents. The library provides a unified interface for multiple commonly used RCA models, including graph construction and scoring tasks. PyRCA is intended to serve as a one-stop RCA solution for IT operations staff, data scientists, and researchers, enabling them to develop, evaluate, and deploy RCA models to real-world applications quickly and efficiently.
With more and more internet applications being deployed on the cloud, ensuring the quality of cloud systems and user experience has become increasingly crucial. Incidents in these systems can lead to a poor user experience and significant economic loss. To address this issue, one effective approach is to build a monitoring system that collects and tracks Key Performance Indicators (KPIs) from running applications. Any anomalies in these KPI metrics can then be treated as incidents. When an incident occurs, engineers typically gather all related metrics and investigate their behaviors to identify the root cause or clues for further diagnosis.
Figure 1. An example of an e-commerce system, which includes web service, payment service, delivery service, shopping service and database service. The left figure illustrates the dependencies between each service, while the right figure displays the causal graph of response time for each service, obtained by inverting the dependency graph.
For instance, in an e-commerce system as illustrated in Figure 1, if the response time of the web service significantly increases, engineers would investigate the response time of upstream services in the causal graph. If the response time of a database service is high, it may suggest that the prolonged response time of the web service is caused by the database service, and the incident can be mitigated by restarting the database server. However, modern cloud systems often comprise a large number of components connected through complex dependencies, running in a distributed environment. With thousands or more KPI metrics to explore for each incident, manually checking all potentially relevant metrics can be time-consuming, labor-intensive, and error-prone for engineers. Therefore, an automated RCA toolbox is highly desirable.
We have developed a holistic PyRCA solution, a Python library for root cause analysis, to cater to the needs of both industrial and academic use cases. PyRCA is the first open-source RCA library that provides an end-to-end framework that includes data loading, causal graph discovery, root cause localization, and RCA results visualization. PyRCA supports multiple causal graph construction and root cause scoring models. Furthermore, it comes with a GUI dashboard to conduct RCA interactively, which is better aligned with the user experience in real-world scenarios. PyRCA's key features are:
We will continue improving PyRCA in the future to make it more comprehensive and easier to use in real-world applications.
Figure 2. The framework of RCA. The left figure shows a production system containing a large number of services with complex interdependencies. The cloud monitoring system collects details of each service request and status in a streaming way. The right figure demonstrates the RCA pipeline.
In Figure 2, we demonstrate an example of how RCA works when integrated with a production system. A typical production system comprises multiple services with complex inter-dependencies. To ensure the reliability of the entire system, a monitoring system is integrated to periodically collect various measures that monitor the health of each service. When anomaly metrics are detected by the anomaly detection module, they usually indicate the corresponding service failure, which can severely impact user experience. The anomaly detection module then automatically triggers the root cause localization task for the RCA module.
The RCA module leverages multiple metrics from the production system and expert knowledge to construct a causal graph. By considering the anomaly metrics and their dependencies in the graph, the RCA module can calculate root cause scores and present the results to site reliability engineers to assist them in subsequent remediation actions. In summary, the RCA objective is to localize the top-K metrics that are most likely to be the root cause of the anomaly metrics, given the anomaly metrics.
Figure 3. The main architecture of PyRCA.
PyRCA's design principles ensure that the library is flexible, extensible, and easy to use. It provides a unified framework for RCA, allowing users to apply multiple models and visualize results. This allows for easy customization through configuration files, and the library can be easily extended with new construction and scoring methods. The interactive dashboard also facilitates incorporating expert knowledge and demonstrating results, making PyRCA easy to use for a wide range of users.
Figure 4. Example of YAML file for configurable expert knowledge.
The PyRCA library API consists of three main components. First, the input layer loads metric data in the 'pandas.DataFrame' format and parses expert knowledge from the configuration file in YAML format. An example of this is shown in Figure 4. The model layer implements a wide range of models for anomaly detection, causal graph construction, and root cause scoring methods. Lastly, the output layer supports visualization of causal graphs and evaluation of root cause analysis results. It also includes a data simulation tool for empirical analysis. Together, these three components provide users with a comprehensive and flexible RCA framework that is easy to use and customize.
Figure 5. The taxonomy of RCA models.
A common type of RCA model involves two steps. The first step is to construct causal graphs based on observed metrics and domain knowledge, while the second step is to extract anomalous subgraphs or paths based on observed anomalies. Typically, these causal graphs can be reconstructed from the topology of a specific application, which is obtained from log analysis and trace analysis. However, when service or call graphs are not available or only partially available, constructing the topology graph of the production system can be challenging. In such cases, causal discovery models can be useful in constructing the causal graph that describes the causal relationships between the observed metrics in a data-driven way. This approach is particularly useful when investigating the relationships between monitored metrics rather than API calls.
Although two-phase RCA models, which involve constructing a causal graph followed by extracting anomalous subgraphs or paths, offer powerful explainability, the runtime of causal graph construction algorithms can be a limiting factor. In the worst-case scenario, the runtime can be exponential in the number of variables (nodes), which can hinder their application in real-world scenarios. On the other hand, one-phase RCA models directly handle normal and abnormal data to output the root causes and have the ability to efficiently handle thousands or even millions of metrics. We show the comparison of these two types of models in Figure 5.
Figure 6. The interactive dashboard of PyRCA (Data Analysis Tab)
Figure 7. The interactive dashboard of PyRCA (Causal Graph Discovery Tab)
PyRCA offers a user-friendly dashboard app that can be launched by running ‘python -m pyrca.tools’. The app consists of several tabs, including "Data Analysis". In this tab, users can easily upload their metric data in CSV format and visualize all the metrics, along with basic statistics such as means and variances. Users can also adjust the hyperparameters for stats-threshold based anomaly detectors. PyRCA comes with a basic stats-based anomaly detector, pyrca.outliers.stats, which can be used to detect anomalous spikes in the data. However, if this detector is not suitable for a user's specific use case, they can explore other anomaly detectors offered by Merlion. It is important to note that the time series data should be in CSV format, where the first column is the timestamp and the other columns represent the metrics.
The "Causal Graph Discovery" tab, shown in Figure 7, allows users to construct causal graphs estimated from metric data. First, users upload the metric data and the optional domain knowledge file in YAML format. Then, they select the metric data to build the graph describing the dependency relationships between different metrics. Users can set the hyperparameters of the causal graph construction algorithm and the domain knowledge file path, and click the "Run" button to generate the initial version of the causal graph. Next, they can manually check for any missing or incorrect links in the generated graph. If the graph has errors, users can add additional constraints such as root/leaf nodes and required/forbidden links in the "Edit Domain Knowledge" card. After the new constraints are added, they can refine the causal graph by clicking the "Run" button again. If the causal graph is satisfactory, users can download and save it for future RCA model deployment.
In real-world applications, causal discovery methods may face challenges in producing accurate causal graphs due to data issues in real-world applications. This app provides a user-friendly interface that allows interactive editing and revision of causal graphs.
Salesforce AI invites you to dive deeper into the concepts discussed in this blog post (see links below). Connect with us on social media and our website to get regular updates on this and other research projects.
Chenghao Liu is a Senior Applied Scientist at Salesforce AI Asia, working on AIOps research, including time series forecasting, anomaly detection, and causal machine learning.
Wenzhuo Yang is a Lead Applied Researcher at Salesforce AI Asia, working on AIOps research and applied machine learning research, including causal machine learning, explainable AI, and recommender systems.
Doyen Sahoo is a Senior Manager, Salesforce AI Asia. Doyen leads several projects pertaining to AI for IT Operations or AIOps, working on both fundamental and applied research.
Steven HOI is the Managing Director of Salesforce Research Asia and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.