A “semi-auto” way to determine parameters for SARIMA model

More aptly: how I learn to get out of the black box of auto_arima

George Tsang
12 min readJan 25, 2022

I have been spending nights to learn time-series analysis, which was not covered in the college program I recently finished. As soon as I read into it, I knew why it was not in the curriculum. It could easily take up a semester and still end up scratching the surface. That being said, there are also countless articles (like this complete guide by Marco Peixeiro) and tutorials on YouTube (e.g., ritvikmath) showing quite a few ways to apply it. Most of them can be grouped into 2 methods:

  1. determining parameters through some tests like Augmented Dickey-Fuller (ADF) test, autocorrelation function (ACF) plot and partial autocorrelation function (PACF) plot; and
  2. determining parameters through pmdarima, a very powerful Python package dedicated to ARIMA model and its variations. The auto_arima function is basically a grid search.

While automation is attractive, I prefer to learn, at least on high level, the principles before applying it. While learning to code and implement the model, I realized it would take forever to difference the series in every possible way, and interpret the ACF and PACF plots of each series one-by-one. I then thought to myself, “is there a ‘semi-auto’ way of to sieve parameters?” Using what I know at this point and my intuition, I came up with the following method.

In essence, there are 4 steps:

  1. Difference the time series with all possible combinations of orders of differencing, where d and D can be 0, 1, or 2, and filter away those non-stationary series. Check the possible values of d and D among the resulting series. This assume length of season (m) is known when doing seasonal differencing.
  2. Calculate ACF and PACF for each stationary series. Pick out the values that are significant from each series and sum their absolute values up by nth lag.
  3. Plot the sum of ACF and PACF by number of lag in bar charts. Then pick out the few lags of largest values as candidates for p and q. For P and Q, a bar at mth lag (and potentially multiples of m) should be present. For example, if m = 12, there could be bars at m*nth lags too. Pick out the few lags of largest values as candidates for P or Q.
  4. Generate all possible combinations of parameters with the candidates picked out in the previous steps. Feed all the combinations into the SARIMA algorithm. Then, evaluate the results as usual.

If these sound confusing to you, no worry, they will be explained by the code snippets and charts below.

Before jumping to the codes, I would like to point out that the number of candidates for each parameter is up to you. The more candidates you choose, more combinations there will be and more time is needed. It really depends on the computational power at your disposal. If you have a crappy computer like mine and still want to run the process on your local machine, I would suggest no more than 3 for each parameter. Assuming that m = 1, and 3 possible values for each other parameter, there would be 3⁶ = 729 combinations. That is quite a lot (but still reasonable to me) and would take quite some time to fit the models. Of course, you can lower the number of maximum iteration, but you might end up with more models that could not converge. In my example, I stick with the default setting of statsmodels’ default setting, i.e., maxiter=50.

Background: The data are excerpts of Table 16–10–0045–01 Lumber, production, shipments and stocks, monthly, inactive (x 1,000) from Statistics Canada. I used this dataset as part of my capstone project on analysis of impact of commercial logging on biodiversity in British Columbia, Canada. At that time, I was using total hardwood and softwood production (abbreviated and referred to as “total wood production” below) as a proxy of logging activity. However, there are data gaps between June 2013 and June 2018. I was looking for a way to fill the gaps but did not know how. I observed seasonality in the data. I knew SARIMA might help but I did not have the time to learn it. At this point, I still did not know that pmdarima exists. I ended up using my intuition and did some simple math which worked surprisingly well (at least visually) in preserving seasonality. It was not until I finished the program and started learning SARIMA, I realized total wood production by volume is essentially the same as that of softwood alone. In other words, modelling softwood production is the same as modelling total wood production. Thus, I use softwood production data to build and cross-validate my model in this case. Without further ado, let’s dive into the codes!

Load the necessary libraries and datasets

The datasets are preprocessed to skip all the data cleaning steps as it is not the point of this article. This helps to keep this article concise.

Exploratory data analysis (EDA)

By overlaying total softwood production and total wood production on the same plot, it is obvious that they are basically the same, except the data gaps.

Total wood and softwood production in BC by volume

If the seasonality is not clear to you, Seasonal-Trend decomposition using LOESS (STL) should make the seasonal component obvious enough.

Components of softwood production after STL

Null values of in the series of total wood production first appear on June 2013. Everything from this point onward are discarded to avoid feeding null values into the model. This resulting series will be used for all of the remaining steps, from creating train-test-split to forecasting. Forecast will be generated from June 2013 and it will be compared to the ground truth, i.e., total softwood production.

Create differenced series

Instead of differencing the series once at a time and check for stationarity, I wrote a function to generate differenced series by all possible combinations of order of differencing. I set the maximum of d and D at 2 because it is enough to make a series stationary in most cases. m is 12 as this is monthly data and the seasonal pattern occurs at 12-month interval.

The output of the function returns a dataframe of original and differenced series.

Differenced series

Check stationarity by ADF test

All the above series are then put through ADF test to see if stationarity is achieved. Significance level (α) is set at 0.05. Series with p-value < 0.05 are considered stationary. Like differencing, I wrote a function to loop each series through the test and store the summarized results as a dataframe.

The output of the function looks like this:

Summary of series that passed ADF test

The non-stationary series are filtered away, leaving those that passed the test. The indices of the summary dataframe is then used to select stationary series from the dataframe of differenced series. From the indices of the summary, I know that possible values for d is 1 and 2, while for D is 0, 1 and 2.

Calculate ACF & PACF

ACF and PACF of each of the stationary series are then calculated through the loops shown below. Confidence intervals for each function are also calculated. If the value is outside the boundaries of its respective confidence intervals, the value is considered significantly different from zero. This is the same as how we interpret spikes in ACF and PACF plots. Since autocorrelation can be negative, I used abs() so that positive values are returned. This makes the lags with negative correlation comparable to those with positive correlation. The significant values are then stored as dataframes.

ACF and PACF of each lag from each series are then summed. For example, if “d2_D0_m0” and “d1_D1_m12” have an autocorrelation of 0.7 at 12th lag and both are significant, they will be added up to be 1.4. Output of the above block of codes are these 2 charts:

Sum of PACF plot for determining p & P (AR terms) / Sum of ACF plot for determining q & Q (MA terms)

The intuition behind summing ACF and PACF values is almost like voting.

For a particular lag:

  1. the larger the ACF or PACF values, the greater the sum; and
  2. the more often significant values are observed, the greater the sum.

At the risk of stating the obvious:

  1. the smaller the sum of ACF or PACF values, the less important that lag is;
  2. if there is no bar for a certain lag, that means no significant value is recorded in any of the stationary series.

With my crappy laptop, I am going to limit the possible combinations of parameters so that it does not take forever to process. I chose only three lags with the greatest sum from each of the plot for p and q respectively, i.e., 1st, 2nd and 11th lag on both plots. For the seasonal terms, since I already know m = 12, P = 1 and Q = 1 will take care of the 12th lag on both plots.

You may notice there are significant values beyond the 12th lag. However, I believe it would not make too much sense to model beyond the length of a season. The model would be too complicated and overfit. I believe it would not make a better model.

Build the model

After identifying candidates for all parameters required to build a SARIMA model, it is time to fit the model. I wrote some functions and loops to generate all possible combinations from the candidates I chose.

With these possible values, I ended up with 3*2*3*1*3*1*1 = 54 combinations, which was much less than I expected.

Here comes another function to loop through the combinations, fitting the models one-by-one. The function returns a dataframe which stores orders, seasonal orders and four evaluation metrics for each model. The information is useful for model selection at later stage. I chose mean absolute percentage error (MAPE), root-mean-square error (RMSE), Akaike information criterion (AIC) and Bayesian information criterion (BIC) because they are among the most popular and easiest to interpret metrics for evaluation.

With everything ready, it is time to create train-test-split and fit the models.

Model evaluation

The codes below will show the top 10 models by each metric. The smaller the value of a particular metric, the better the model in terms of that metric.

However, I am not going to show all the outputs here because a lot of them are irrelevant, especially the least AIC and BIC models because they perform quite poorly. Instead, I will just show you the best models by each metric.

L1 = model_info[model_info.MAPE == model_info.MAPE.min()]
L2 = model_info[model_info.RMSE == model_info.RMSE.min()]
L3 = model_info[model_info.AIC == model_info.AIC.min()]
L4 = least_BIC[least_BIC.MAPE == least_BIC.MAPE.min()]
pd.concat((L1, L2, L3, L4))

The output of the code is as below:

Note: since the least AIC model also has the least BIC, I decided to show instead the least MAPE model among the least AIC/BIC models.

It seems that models with smaller errors tend to have higher AIC and BIC, while models with lower AIC and BIC tends to have bigger errors. Note that the MAPE and RMSE calculated here are training error.

Let’s plot the results of each model with the following codes and see how their predictions compared to ground truth.

Performance of the best models selected by each evaluation metric

The plots include data from both training and testing sets. The black vertical dotted lines indicate where the training and testing sets were splitted. Obviously, the least MAPE and RMSE models performed better than the least AIC model. In this case, I would go forward to cross-validation with the 10 least MAPE models, which happened to have the least RMSE models too.

Cross-validation

There are multiple ways to cross-validate a model. Here, I stick with the most basic method, i.e., leaving the last n% of the data as testing set. For my dataset, I held the last 20% data out as testing set. The train-test-split was done before fitting the model. Now, I just need to calculate MAPE and RMSE again using the testing set and the forecast results.

The output looks like this:

Cross-validation results of the 10 least MAPE/RMSE models

As you can see model no. 6 has the lowest testing MAPE and RMSE. On the other hand, model no. 39, which has the lowest training MAPE, only ranks at 8th on testing MAPE. Let’s plot the forecast of model no. 6.

Performance of the model with least training MAPE & RMSE

Looks like we have a winner! But wait, according to the ADF test, this model was built on a non-stationary series (d = 1, D = 0)! Actually, among the 10 least MAPE and RMSE models, there is only one built on a stationary series, i.e., model no. 43, which has d = 1 and D = 1. Let’s plot and see how it compares to model no. 6.

Performance of the model built on stationary series

Looks like model no. 6 is slightly better than model no. 43.

Comparison with auto_arima

I mentioned pmdarima and its auto_arima function in the beginning of this article. It is because I feel that is a landmark for its efficiency, ease of use and popularity. This study is not complete without comparing the models I selected with the models selected by auto_arima.

I tried both exhaustive grid search and stepwise search. For grid search, I make sure maximum of p, d, q, P, D and Q are the same as what I used in my process. For stepwise search, I just let it does its job because this is the essence of auto_arima — automation. Information criterion is set to out-of-bag error, which is calculated as mean absolute error (MSE). This evaluation metric is similar to RMSE, which I also use in my selection process. Out-of-bag sample size is set to the same size as the testing set used in my process.

To compare the models I selected with the models selected by auto_arima, I retrained all the models with the entire series, i.e., training+testing data, and made forecast up to December 2014. The forecast was then compared to the ground truth, which is total softwood production. The forecast period is about 1.5-year-long and should be enough. Going any further does not make too much sense as the forecast of the far-future is based on the forecast of the near-future. The trend and seasonal pattern will just repeat itself.

The above block of codes creates train-test-split, fits the models and returns this dataframe containing prediction errors, AIC and BIC of each model.

Prediction errors and other metrics of the models

Voilà! Here I have four of the best models. I am very happy with the result as the performance of the models I selected is very similarly to the models selected by auto_arima. Model no. 43 has slightly higher MAPE and RMSE, but lower AIC and BIC compared to the model selected through exhaustive grid search. Depending on situation and selection criteria, stepwise search may return a better model.

Lastly, let’s visualize the results of the models.

Performance of the best models selected from both processes

Last but not least, I also timed my process from the differencing step up to the model fitting step, which is ~100 seconds. FYI, exhaustive grid search required ~550 seconds and stepwise search required only ~4 seconds. I am pleased with my results.

Reference

Statistics Canada. Table 16–10–0045–01 Lumber, production, shipments and stocks, monthly, inactive (x 1,000)
DOI: https://doi.org/10.25318/1610004501-eng

Click here for the jupyter notebook used for this article.

Epilogue

This whole process stemmed from my experience of learning SARIMA and time-series analysis in general. I enjoyed a lot. While my process might be an alternative, I did not intend to create some magical algorithms that outperform auto_arima. auto_arima will still be the way to go if time is of the essence. With that being said, I prefer to know the what, why and how things happen, and I absolutely do not mind spending time on that.

Edit: there are still a lot of things I would like to but did not touch on, e.g., evaluating models with other cross-validation methods, bias-variance tradeoff, and other statistical test for trend and seasonal stationarity. I hope to investigate and cover these in the future.

--

--

George Tsang

Studied biology and data analytics, self-taught in music and photography. Looking for the algorithm that connects the dots in my life.