BIRCH for Anomaly Detection with InfluxDB
By
Anais Dotis-Georgiou /
Product, Use Cases, Developer
Jul 10, 2020
Navigate to:
In this tutorial, we’ll use the BIRCH (balanced iterative reducing and clustering using hierarchies) algorithm from scikit-learn with the ADTK (Anomaly Detection Tool Kit) package to detect anomalous CPU behavior. We’ll use the InfluxDB 2.0 Python Client to query our data in InfluxDB 2.0 and return it as a Pandas DataFrame.
This tutorial assumes that you have InfluxDB and Telegraf installed and configured on your local machine to gather CPU stats. To easily gather system stats on your local machine, install InfluxDB and automatically configure Telegraf to add the System plugin.
We recommend running the code in this blog inside a virtual environment with Python 3.6+. The requirements.txt
for this project looks like:
adtk==0.6.2
pandas==0.23.4
sklearn==0.23.1
A brief explanation of BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised clustering algorithm optimized for high performance on large datasets. It’s also good at reducing noise in the dataset to find meaningful patterns and produce accurate models. It’s similar to the more popular k-means clustering algorithm.
An introduction to ADTK and scikit-learn
ADTK (Anomaly Detection Tool Kit) is a Python package for unsupervised anomaly detection for time series data. According to the documentation, “This package offers a set of common detectors, transformers and aggregators with unified APIs, as well as pipe classes that connect them together into a model. It also provides some functions to process and visualize time series and anomaly events.” I enjoy this package a lot because it’s easy to use, well-documented, and the modules are efficient and lightweight.
Steps to use BIRCH for time series anomaly detection
Step One: Import dependencies
import pandas as pd
from sklearn.cluster import DBSCAN
from influxdb_client import InfluxDBClient
from adtk.detector import MinClusterDetector
from sklearn.cluster import Birch
from adtk.visualization import plot
Step Two: Use the InfluxDB 2.0 Python Client to query the data and return a DataFrame
- Gather the authorization and query parameters (tokens, organizations, buckets) and store them in variables.
- Create a Flux query to gather CPU data from a local machine. The Flux query uses pivot() and drop() to transform our data into the right shape. Of course, this data transformation can also be performed with Pandas as well.
- Pass those variables into the client object and instantiate the client.
- Use the query_data_frame() method to return our data as Pandas DataFrame.
token = <your token>
org = <your organization>
client = InfluxDBClient(url="https://127.0.0.1:9999", token=token, org=org)
query = '''from(bucket: "your-bucket")
|> range(start: 2020-06-18T18:00:00Z , stop: 2020-06-20T02:00:00Z)
|> filter(fn: (r) => r["_measurement"] == "cpu")
|> filter(fn: (r) => r["_field"] == "usage_system")
|> pivot(rowKey:["_time"], columnKey: ["cpu"], valueColumn: "_value")
|> drop(columns:["_start", "_stop", "host", "_field", "_measurement"])'''
query_api = client.query_api()
df = query_api.query_data_frame(query)
df.head()
Step Three: Transform and prepare the data
- To prepare the DataFrame, `df`, for consumption by the ADTK MinCluserDetector function perform the remaining data transformation:
- Convert the time column to a datetime object and make it the index.
- Drop any extraneous columns.
- Use
head()
to return the first five rows of the DataFrame we created in Step Two.
df["_time"] = pd.to_datetime(df["_time"].astype(str))
df = df.drop(columns=["result", "table"])
df = df.set_index("_time")
df.head()
Step Four: Use the ADTK MinClusterDetector function to apply sklearn's BIRCH algorithm on our dataframe
- Instantiate the MinClusterDetector function with our desired scikit-learn anomaly detection algorithm, BIRCH, and specify the number of clusters,
n_clusters=10
. - According to the docs, the MinClusterDetector "function treats multivariate time series as independent points in a high-dimensional space, divides them into clusters, and identifies values in the smallest cluster as anomalous. This may help capture outliers in high-dimensional space".
Please note:
-
- You can pass any scikit-learn clustering algorithm as the model type into the MiniClusterDetector function as long as it has a
fit
andprediction
method. - The cluster number value was chosen somewhat arbitrarily for this tutorial. The success of the ADTK BIRCH model was examined through visual inspection for various n_clusters values. The model appeared to fit data and flag anomalies successfully at
n_clusters=10
. - Because BIRCH is an unsupervised learning technique, optimizing the number of clusters requires an analysis of how changes in cluster number affect anomaly detection accuracy. This analysis requires obtaining a labeled dataset to measure the accuracy of our model against changes in cluster number. I didn't have a labeled anomalous CPU stats dataset, so I didn't perform this analysis. However, determining cluster size is both critical to effectively employing clustering algorithms and fairly simple. To learn more about cluster number selection, please take a look at this article.
- You can pass any scikit-learn clustering algorithm as the model type into the MiniClusterDetector function as long as it has a
- Use the
fit_detect()
function to detect anomalies and store the series into a variable. - Plot the DataFrame along with the anomalies with ADTK's plot function. Specify the plot attributes.
min_cluster_detector = MinClusterDetector(Birch(n_clusters=10))
anomalies = min_cluster_detector.fit_detect(df)
plot(df, anomaly=anomalies, ts_linewidth=1, ts_markersize=3, anomaly_color='red', anomaly_alpha=0.3, curve_group='all')
<figcaption> Applying BIRCH to a DataFrame to model time series anomalies across CPUs</figcaption>
The MinClusterDetector function and application of BIRCH detects anomalies around 06-18-20 and 06-19-18. Let’s zoom in on some of the anomalies.
plot(df[500:525], anomaly=anomalies[500:525], ts_linewidth=2, ts_markersize=4, anomaly_color='red', anomaly_alpha=2.0, curve_group='all')
<figcaption> A closer look at the anomalies detected by BIRCH and ADTK</figcaption>
Cpu0 deviates in behavior from the rest of the processors. Specifically, cpu0 (orange) exhibits positive trends when the rest of the processors exhibit a negative trend.
Conclusion on incorporating BIRCH anomaly detection with InfluxDB
While this tutorial focused on using the MinClusterDetector function, the ADTK package has several effective and lightweight functions for anomaly detection. I encourage you to review them and find those that fit your time series use case. Naturally, you might be wondering: “How can I apply the ADTK package to my time series data in a continuous function with InfluxDB?”
Your options include:
- Using the execd processor telegraf plugin to continuously run an external Python program and pipe in metrics
- Writing an InfluxDB task and using the http.post() Flux function to trigger the execution of a Python script on a set schedule
I hope this tutorial inspires you to integrate ADTK and anomaly detection into your InfluxDB-powered solution. You can share your thoughts, concerns, or questions in the comments section, on our community site, or in our Slack channel. We’d love to get your feedback and help you with any problems you run into!