I have an awesome textbook on statistics. It covers most statistical things, but one of the things you will not find in this awesome textbook is anything on time series. Time series are different and that makes them really interesting to me. This is because the x-axis is time, with the y-axis the thing (KPI) you are measuring. Because of that, you need to use different models to predict outcomes.

In these models, the y-axis values are compared to other values on the y-axis (the lag values). For example, this month’s revenue is compared to last month’s revenue and the revenue of the month before, going all the way back to the beginning of the dataset. The chosen comparisons are based on the statistically significant lags. Last month’s revenue might be statistically significant to the current month’s revenue, but revenue two months ago may not be (but revenue from three months ago might be).

In this post, I will give an overview of one type of time series models and describe how they work in simple language. In future posts, I will describe how to implement the models in BigQueryML and in R.

## ARIMA Models

With no further ado, let me introduce to you ARIMA models. The acronym stands for auto-regressive (AR) integrated (I) moving-average (MA). ARIMA models can be broken down into AR models, and MA models. It can also be pulled together as ARMA models, without the integration (I).

ARIMA models are a subset of linear regression, but different compared to other linear regression models, because of time as the x-axis. However, like other linear regression models, the y-axis has to be values that can be aggregated, as opposed to classification models, where the y-axis cannot be aggregated.

Examples of y-axis values that can be aggregated are revenue, visitors, or email subscribers. An example of values in a classification model would be what percentage of visitors are from a given traffic source. In that case, the y-axis would be traffic sources. Just to make it clearer between the two, in a regression model you would like to know how revenue changes over time; in a classification model you would like to know how the traffic source mix changes over time. An ARIMA model works if you would like to know how revenue changes over time. It does not work if you would like to know how the traffic source mix changes over time.

There are machine learning packages for ARIMA models and they can do the bulk of the work for you. However, you should understand the basics of the models because it will help you to understand and statistically describe the underlying data (for example, is there white noise or seasonality in the data).

## Why You Should Understand Statistics

For years, I aggregated, summarized and analyzed data without statistically understanding it. This is what I suspect most Data Analysts do. They don’t describe the data, they report it. That means they don’t understand correlations between the variables, white noise, seasonality, or trending. This is before you even get to formulating prediction models.

In my humble opinion, Data Analysts should understand statistics so they can at least understand and statistically describe the data. Besides, stats is awesome and powerful!

## Stationarity

The first step in using an ARIMA model is to check if the data is stationary. Stationary processes are easier to analyze than non-stationary processes. A stationary process looks flat over time with no discernable trend. Statistically, a stationary process has a constant mean and variance (or standard deviation) and no seasonality.

It is likely that most of the time series datasets that we would be interested in, like revenue by month, would not be stationary. What do we do? We have to use a statistical method to transform the data into a stationary process.

One statistical method to transform the data is differencing. Differencing is the process of taking the value for this time period and subtracting the value from a previous time period (the lag value). Using the difference will create a more stationary process as the outcome.

For example, if we’re looking at revenue by month, then subtracting last month from this month would give you the value that you would plot. If revenue last month was £80K and this month is £100K, then the difference is £20K. Let’s say that the next month is again £80K. Then the value that you would plot would be -£20K.

If you were to plot the above, you would likely find that the mean would be the same, regardless if you took just a section (local) or the entire dataset (global).

There are other methods to achieve stationarity, like using the rolling mean or transformations like the log or square root.

## Unit Roots

A unit root is a unit of measurement to help determine whether a time series dataset is stationary. In the discussion above on stationarity, I described a stationary process and how to solve the issue of non-stationarity. However, it can be difficult to determine if a dataset is stationary in the first place.

If a unit root equals 1, the dataset has a unit root and is not stationary. If it is less than 1, then it is likely stationary – it does not have a unit root. However, if a unit root is greater than 1, then it has an explosive root. Explosive roots are rare and usually unrealistic. So if you have an explosive root, you probably have other issues with your data. We are then left with addressing only the unit root case.

## Augmented Dickey-Fuller Test

The next question regarding unit roots is how do you calculate them. One of the most popular methods is using the augmented Dickey-Fuller test. This expands on the original Dickey-Fuller test by including more differencing terms – the difference between last period and this period, period before last and this period, the period before that and this period, etc.

An augmented Dickey-Fuller test (ADF) uses hypothesis testing to determine if a unit root is present. It begins with the assumption that there is a unit root in the dataset (the null hypothesis) – that the test will produce a 1. In that case, the null hypothesis is not rejected and it is determined that there is a unit root. If the test produces anything other than a 1, the null hypothesis is rejected and it is determined that there is not a unit root. Of course, there is the confidence interval and p-value to keep in mind as with any hypothesis testing.

## White Noise & Seasonality

Other than unit roots, white noise and seasonality can also impact the modeling of time series. Even if you are not interested in forecasting, wouldn’t it be great to know that your dataset has seasonality or is white noise and have tools to check for these, instead of just eye-balling a graph. For modeling the data, it’s critical to know if either of these exist in your dataset.

White noise time series are ones where there is no correlation between the current value and the lagged values. That means that none of the lagged values help in predicting the current value. The autocorrelation function in a white noise time series will return a value close to zero.

Seasonality occurs in a dataset when there is a fixed and known pattern to the data. This usually occurs due to time of year or day of the week. A time series decomposition function can be used to measure whether there is seasonality and how strong the seasonality is.

## ACF

The Autocorrelation Function (ACF) and Partial Autocorrelation Function (ACF) help to understand the relevance of lags (values in past time periods) in AR, MA, and ARMA models. It helps to describe which lags are relevant and which are not.

First, understand that the AR model assumes that the current value is dependent on the previous values, the lagged values. The MA model assumes that the current value is dependent on the errors in the lagged predicted values as well as the error in the current value.

ACF plots the correlation coefficient to determine if the observed values in the series are random (white noise) or correlated. It can describe how correlated the values are to the lagged values and answer whether the times series can be successfully modeled in an MA model.

The correlation coefficient determines how correlated the values are. A -1 means the values have a perfectly negative relationship. Conversely, A +1 means the values have a perfectly positive relationship. A zero means that the values have no correlation at all. For example, when plotted, the value for the current time period has a value of +1 because it is perfectly correlated to itself. Generally, the further back in time the value to the current value, the less correlated it is.

## PACF

The PACF plot can answer whether the data can be modeled in an AR model or not. It measures the correlation between a value and each lagged value while controlling for the effects of other lagged values.

For example, for a given current value, we calculate the effects of the previous period, then we calculate the effect of the period before that, while controlling for the effect of that previous period. We are isolating the partial correlation of each lagged value to the current value, removing any effects of other lags that come after that particular lagged value.

## AR, MA & ARMA

The AR (Auto Regression) model uses only the lagged values in the dataset. PACF is used to determine which lags over the dataset are statistically significant. Those lags are then used to calculate current and future values.

A MA (Moving Average) model is a little harder to understand. It takes the errors (residuals) of past time series (same dataset) and calculates the present and future values. It is basically forecasting the last period’s value and comparing it to the actual value. That difference is the error. It then weights the present and future values based on the forecasted errors.

An ARMA combines the AR and MA models together. Both the previous lags as well as the residuals are both considered in forecasting present and future values.

## ARIMA

With the previous models, you need to first handle stationarity issues. ARIMA models combine the AR & MA, like the ARMA model. However, it handles stationarity issues by adding differencing for non-stationary datasets.

In these models, we need to specify the hyperparameters. These are the autoregressive trend order (p), the difference trend order (d), and the moving average trend order (q). The notation would look like this: ARIMA(p,d,q).

For example, an ARIMA(1,1,1) would contain one AR term, a first differencing term, and one MA term. An ARIMA(2,0,0) would essentially be an AR model with two AR terms and an ARIMA(0,0,2) would essentially be a MA model with two terms.

## SARIMA

But what if there is seasonality in the dataset? That is where a seasonal ARIMA (SARIMA) comes in. A SARIMA model adds four seasonal elements to the three in the ARIMA model. They are the seasonal autoregressive order (P), the seasonal difference order (D), the seasonal moving average order (Q) and the number of time steps for a single seasonal period (m).

The notation for a seasonal ARIMA would look like this: SARIMA(p,d,q)(P,D,Q)m.

## Conclusion

This post provides a general overview of ARIMA models, with no formulas, code, or graphs. Time series models are interesting and very cool, but there is a lot to consider & understand when using them. Luckily there are machine learning packages and functions that can do a lot of the work for you.

If you want to go deeper into ARIMA models and other time series models, check out this playlist by ritvikmath.

In my next few posts, I will go into more detail and explain how to use ARIMA models in BigQuery and in R. Stay tuned!

P.S. If you are unfamiliar with BigQuery, check out this post. If you are unfamiliar with R, here is a page from the R Foundation.