Time Series Forecasting with Python and Facebook Kats
By
Community /
Product, Use Cases
Oct 12, 2022
Navigate to:
This article was written by Vidhi Chugh. Scroll down for the author’s bio and photo.
Time series analysis is the study of a sequence of data points and records that are collected over a constant period. The analysis indicates how a variable or a group of variables has changed and helps in discovering underlying trends and patterns.
Time series data is generally used for forecasting problems by predicting the likelihood of future data based on historical information. Weather forecasts, stock price predictions, and industry growth forecasts are some of the most popular applications of time series analysis.
Recent advancements in machine learning algorithms, like long short-term memory (LSTM) and Prophet, have led to significant improvements in forecast accuracy.
In this article, you’ll learn how to use InfluxDB to store and access time series data with Python API and analyze it using the Facebook Kats library.
What is the Facebook Kats Toolkit
There are a number of Python libraries used for analyzing data, including sktime, Prophet, Facebook Kats, and Darts. This post will focus on the Kats library because it’s a lightweight and easy-to-use framework frequently used for analyzing time series data. It offers various functionalities, including the following:
-
Forecasting: The Kats library provides a range of tools, including forecasting algorithms, ensembles, a meta-learning algorithm with hyperparameter tuning, backtesting, and empirical prediction intervals.
-
Detection: It detects patterns, like trends, seasonalities, anomalies, and change points.
-
Auto feature engineering and embedding: The time series feature in the Kats
tsfeatures
module autogenerates features for supervised learning algorithms. -
Utilities: The Kats library provides time series simulators for learning and experimentation.
What is InfluxDB
Time series analysis requires a database suitable for storing and retrieving data effectively and efficiently. Here, you’ll use InfluxDB, which is one of the leading platforms for building time series applications. It’s a high-performing engine that is open sourced, has vast community reach, and is easy to use. In addition, it can be hosted locally or on the cloud.
Implementing time series forecasting with Python and Facebook Kats
In the following tutorial, you’ll learn how to create a simple forecasting data set that uses InfluxDB to store the data and then analyze it with Facebook Kats.
Prerequisites
Before you begin, you should have a basic understanding of Python syntax and command line/terminal commands.
All the code for this tutorial is available in this GitHub repository.
Connect to an InfluxDB Cloud instance
To get started with the InfluxDB Cloud instance, visit InfluxData’s website and click on Get InfluxDB in the upper-right-hand corner:
Select Use it for Free for the cloud-only account interface:
Then you’ll be navigated to a sign-up page where you can sign up after inputting the necessary information:
You need to select a cloud service provider (Amazon Web Services - AWS) was selected here) to store your InfluxDB time series data. You don’t need to be familiar with any of these services, as InfluxDB abstracts away all the underlying complexities. Once you’ve selected your cloud provider, add your company name and agree to the terms:
Now you’re ready to begin setting up your database. Choose the plan of your preference. In this instance, a free plan is sufficient:
After selecting your plan, you’ll be taken to a Get Started screen that lists a number of programming languages. Select Python for this demo:
On the next page, you need to watch the video that shows you how to set up InfluxData. Click Next once you’re done:
Set up your local machine
To access InfluxDB through Python, you need to install the influxdb-client
library on your machine. You can do this by running the following command in the terminal:
pip3 install influxdb-client
Please note: pip3 is used for installing libraries in Python 3.X and is used here.
Generate your API token from the web interface by navigating to API Tokens > Generate API Token > All Access API Token. You’ll be using an All Access API Token for this tutorial, though you can also choose to generate a Custom API Token if you want to choose the authentication level of the user.
Run the following command in the terminal or command line to add your token as an environment variable:
export INFLUXDB_TOKEN = "your token"
The token is not included in the code to maintain shareability across teams.
From here, you’ll be using a Python IDE or a Jupyter Notebook to write and read data to InfluxDB.
Write data to InfluxDB
To write data to InfluxDB, you’ll need access to some data. Here, you’ll use the Air Passengers data set, which is a model data set representing monthly air passengers from 1949 to 1960.
To begin, install the pandas library on your operating system using the following command:
pip3 install pandas
Create a file named writePassengerData.py
and paste the following code. And make sure to put the AirPassengers.csv
in the same directory:
writePassengerData.py
import pandas as pd
import influxdb_client, os, time
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
token = os.environ.get("your influx db token")
org = "your influx db org name"
url = "your influx db custom url"
bucket = "your influx bucket name"
client = influxdb_client.InfluxDBClient(url=url, token=token, org=org)
write_api = client.write_api(write_options=SYNCHRONOUS)
df = pd.read_csv('AirPassengers.csv')
for i in df.index:
point = (
Point("passengers")
.tag("month", df.iloc[i,0])
.field("passengers", df.iloc[i,1])
)
write_api.write(bucket=bucket, org=org, record=point)
time.sleep(1) # separate points by 1 second
In the previous code block, make sure you use the token generated on the cloud platform and the org name entered earlier. Here, you import necessary libraries such as pandas
for reading CSV data and influxdb_client
for writing data to InfluxDB Cloud instances. Then you declare the string variables to hold information like the token, org, URL, and bucket. Next, you instantiate the client using InfluxDBClient()
and activate the write API using the write_api()
method. The CSV file is read using pandas.read_csv
and stored in a data frame object. Then you iterate over each row in the data frame, create a temporary point object, and write the point object as a record to InfluxDB.
Next, run the file from your terminal:
writePassengerData.py
Verify that the data is written/uploaded to an InfluxDB Cloud bucket by going to InfluxDB Cloud and clicking on Buckets. Then click on your bucket and verify that the measurement name passengers is available:
Read data from InfluxDB
To be able to read data from InfluxDB, you need to create a file named timeSeriesAnalysis.ipynb
and include the following code in it:
timeSeriesAnalysis.ipynb
import influxdb_client, os, time
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
token = os.environ.get("your influx db token")
org = "your influx db org name"
url = "your influx db custom url"
client = influxdb_client.InfluxDBClient(url=url, token=token, org=org)
query_api = client.query_api()
query = """from(bucket:"your influx db bucket name")
|> range(start: -1000m)
|> filter(fn: (r) => r._measurement == "passengers")
|> mean()"""
tables = query_api.query(query, org=org)
results = []
for table in tables:
for record in table.records:
results.append(({'month': record.values.get('month'), record.get_field(): record.get_value()}))
In this code block, you reimport all the libraries (as demonstrated earlier). Next, you declare the variables that are holding information like the token, org, and URL. Then you instantiate the client using InfluxDBClient()
and activate the query API using query_api()
.
Please note: Querying is used as a synonym for reading in database languages.
Once you’ve activated the query API, you can add details like bucket and measurement names to the Flux query. Using the query API, fetch and store data in the tables objects and iterate over the tables, parse the required information, and append it to a list:
After running this code, you’ll get a list of dictionaries containing your data that looks like this:
[{'month': '1949-03', 'passengers': 132.0},
{'month': '1954-04', 'passengers': 227.0},
{'month': '1952-06', 'passengers': 218.0},
{'month': '1956-07', 'passengers': 413.0},
{'month': '1958-06', 'passengers': 435.0},
{'month': '1955-09', 'passengers': 312.0},
{'month': '1956-02', 'passengers': 277.0},
{'month': '1958-01', 'passengers': 340.0},
{'month': '1954-11', 'passengers': 203.0},
{'month': '1959-07', 'passengers': 548.0}]
Now it’s time to make this readable and convert it to the pandas dataframe.
Import libraries for time series forecasting
Now that you have read your data, it’s time to begin analyzing it.
In order to carry out the analysis, you need to install and import the following libraries:
Pip3 install numpy
Pip3 install kats
Pip3 install statsmodels
Pip3 install warnings
Pip3 install matplotlib
pandas, NumPy, and Matplotlib are installed for data manipulation and visualization. You also need to import SARIMA, Holt-Winters, and Prophet for time series analysis from Kats. The models
and TimeSeriesData
are installed in order to convert a standard data frame to a time series object consumable by the Kats library:
timeSeriesAnalysis.ipynb
import pandas as pd
import numpy as np
import sys
import matplotlib.pyplot as plt
import warnings
import statsmodels.api as sm
from kats.models.sarima import SARIMAModel, SARIMAParams
from kats.models.holtwinters import HoltWintersParams, HoltWintersModel
from kats.models.prophet import ProphetModel, ProphetParams
from kats.consts import TimeSeriesData
Convert data to a time series format
To convert data to a time series object, you need to begin by converting the results list to a data frame. Sort the data frame values by month in ascending order and rename the columns to “time” and “value” from “month” and “passengers” (this is a standard step in the Kats library), respectively. Finally, convert the data frame object to a time series data object:
timeSeriesAnalysis.ipynb
air_passengers_df = pd.DataFrame(results)
air_passengers_df.sort_values('month', inplace=True)
air_passengers_df.columns = ["time", "value"]
air_passengers_ts = TimeSeriesData(air_passengers_df)
Check for stationarity
Now, it’s time to visualize if the time series is stationary, which is one of the prerequisites to model time series data:
timeSeriesAnalysis.ipynb
plt.figure(figsize=(35,20))
fig = plt.plot(air_passengers_df['time'], air_passengers_df["value"])
plt.xticks(rotation=90)
plt.show()
The time series is non-stationary, as the mean and variances are not constant:
You can confirm this with the Augmented Dickey-Fuller test:
timeSeriesAnalysis.ipynb
from statsmodels.tsa.stattools import adfuller
X = air_passengers_df["value"]
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
A high p-value
suggests that we cannot establish stationarity for this time series, as you can see from the output of the ADF test:
ADF Statistic: 0.815369
p-value: 0.991880
You can make the series stationary by differentiating when you run the following code:
timeSeriesAnalysis.ipynb
plt.figure(figsize=(35,20))
fig = plt.plot(air_passengers_df['time'], air_passengers_df["value"].diff())
plt.xticks(rotation=90)
plt.show()
As you can see, the Augmented Dickey-Fuller test gives better results:
timeSeriesAnalysis.ipynb
result = adfuller(X.diff()[1:])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
Your output will look like this:
ADF Statistic: -2.829267
p-value: 0.054213
Now that the series is almost stationary, it’s time to plot the autocorrelation function (ACF) and partial autocorrelation function (PACF) charts.
ACF and PACF plots
ACF and PACF charts determine the moving average (MA) lag (q
) and autocorrelation (AR) lag (p
):
timeSeriesAnalysis.ipynb
fig, ax = plt.subplots(2,1)
fig.set_figheight(15)
fig.set_figwidth(15)
fig = sm.graphics.tsa.plot_acf(air_passengers_df["value"].diff()[1:], lags=50, ax=ax[0])
fig = sm.graphics.tsa.plot_pacf(air_passengers_df["value"].diff()[1:], lags=50, ax=ax[1])
plt.show()
Based on the following PACF and ACF charts, p = 2
and q = 1
are good values to begin with. The hyperparameters can be tuned further using grid search and an out-of-bag (OOB) sample:
SARIMA model
To train and predict the SARIMA model, start by declaring the p
, d
, q
, and m
params, where d = 1
for linearly trending data and m
is the seasonality index of twelve months (air travel seasonality). Instantiate a SARIMA model object using the training data and parameters and fit the model. Then predict using the trained model and plot the data and predictions as shown here:
timeSeriesAnalysis.ipynb
# declare SARIMA parameters - use acf/pacf charts and grid search
params = SARIMAParams(p = 2, d=1, q=1, seasonal_order=(1,0,1,12), trend = 'ct')
# train sarima model
m = SARIMAModel(data=air_passengers_ts, params=params)
m.fit()
#forecast for next 30 months
fcst = m.predict(steps=30, freq="MS")
# visualize predictions
m.plot()
Though the model is able to identify seasonality and autocorrelation, it does not recognize the increasing range in the data, leading to a high error variance:
Holt-Winters model
The Holt-Winters model overcomes the shortcomings of the SARIMA model by capturing the increasing range in seasonality and generates predictions with higher confidence:
timeSeriesAnalysis.ipynb
# declare parameters for Holt Winters model
params = HoltWintersParams(trend="add", seasonal="mul", seasonal_periods=12)
# fit a Holt Winters model
hw_model = HoltWintersModel(data=air_passengers_ts, params=params)
hw_model.fit()
# forecast for next 30 months
fcst = hw_model.predict(steps=30, alpha = 0.1)
# plot predictions
hw_model.plot()
Prophet time series model
The Prophet time series model builds upon the Holt-Winters model to further improve the predictions. The steps followed are similar to the two examples discussed previously:
timeSeriesAnalysis.ipynb
# declare parameters for Prophet model - choose between additive or multiplicative
# multiplicative gives better results
params = ProphetParams(seasonality_mode='multiplicative')
# fit a prophet model instance
model = ProphetModel(air_passengers_ts, params)
model.fit()
# forecast for next 30 months
fcst = model.predict(steps=30, freq="MS")
# visualize predictions
model.plot()
The Prophet model predictions have the least error variance and, thus, the highest confidence:
All the code for this tutorial is available in this GitHub repository.
Conclusion
In this article, you learned about the significance of time series data and some of its key applications across multiple industries. You also learned how to set up an InfluxDB Cloud instance and write a CSV using a Python API. From there, you read data from InfluxDB for analysis using Facebook Kats. The article concluded with a step-by-step walkthrough of three popular time series algorithms, namely, SARIMA, Holt-Winters, and Prophet, along with their performance comparison.
If you’re looking for a platform for building and operating time series applications, check out InfluxDB. InfluxDB is open source and empowers developers to build and deploy transformative monitoring, analytics, and IoT applications faster and to scale. The platform can handle massive volumes of time series data produced by networks, IoT devices, apps, and containers.
About the author
Vidhi Chugh is an award-winning AI/ML innovation leader and a leading expert in data governance with a vision to build trustworthy AI solutions. She works at the intersection of data science, product and research teams to deliver business value and insights at Walmart Global Tech India. Learn more on her LinkedIn page and Medium profile.