In this notebook, we explore the forecasting models from statsmodel, for example, AR, MA, ARMA/ARIMA, SARIMA, VAR and VARIMAX. The main motivation behind doing this task is to understand how to model the timeseries data with these algorithms. In this notebook, there is two sections:
1) univariate time series foreceasting
2) multi-variate forecasting(forecasting multiple series at same time).
To do univariate forecasting, I have used the dataset- 'airline passenger' to all univariate algoriths(AR, MA, ARMA/ARIMA, SARIMA). Similarly, VAR and VARIMAX are used for modeling multi-variate series. And, lastly figure out which algorithm is suitable for this data type. In addition, this notebook gives how to apply the stationarity check (series has unit root or not) with augmented Dickey-Fuller Test.
#import required libraries
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from pylab import rcParams #set figure size
from pmdarima import auto_arima # for finding arima orders
# for metric
from statsmodels.tools.eval_measures import rmse
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score
from statsmodels.tsa.stattools import adfuller # for stationarity test
from statsmodels.tsa.statespace.tools import diff
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf, month_plot,quarter_plot #plots
#for creating models
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARMA, ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.api import VAR, VARMAX
#ignore warnings
import warnings
warnings.filterwarnings("ignore")
Univariate Time Series Anlaysis
#load dataset
df= pd.read_csv('Data/airline_passengers.csv', index_col='Month', parse_dates=True)
df.index.freq = 'MS'# since data is monthly start frequency
df.head()
#lets visualize the data
ax = df['Thousands of Passengers'].plot(figsize=(12,5), legend= True, label='Original')
df['Thousands of Passengers'].rolling(window=12).mean().plot(legend= True, label='12-Month-Mean')
df['Thousands of Passengers'].rolling(window=12).std().plot(legend= True, label='12-Month-Std')
ax.autoscale(axis='x',tight=True) # fit to x-axis
ylabel='No. of passengers'
xlabel = 'Months'
ax.set(xlabel=xlabel, ylabel=ylabel) #name axis
plt.show()
From the plot, it clearly shows that this is non-stationary series since the mean and standard deviation keep on variation with resect to time period. Even, we can see that this series has clear upward trend. We will perfome statistical test to test stationary or not later in this work.
Lets look the general behaviour in series. From ETS decomposition, we may get idea/clue whether time series has trend, seasonlaity or not.
# lets do seasonal decompostion
rcParams['figure.figsize']=15,7 # set figure size
sea_decom = seasonal_decompose(df['Thousands of Passengers'], model='additive')
sea_decom.plot();
Figures show clear upward trend and may be some seasonlity
These plots are usually used for finding the 'p' and 'q' for ARIMA model, but in this notebook instead of using this classical method we will use auto_arima() function. To check whether or not there is correlation with the shiffted samples in the time series, these plots are also useful.
#create auto correaltion plot
plot_acf(df['Thousands of Passengers'], lags=40, title='ACF plot')
plt.show()
Figure shows there is correlation as well as seasonality in certian periods.
#create partial correaltion plot
plot_pacf(df['Thousands of Passengers'], lags=40, title='ACF plot');
Lets figure out, which months and qauterly have low and high passenger.
#monthly plot
month_plot(df['Thousands of Passengers']);
#quaterly plot
quarter_plot(df['Thousands of Passengers'].resample(rule='Q').mean());
From both quaterly and montly plots, we can see there is some up and down in the airplane passenger. For the month of JUNE and JULY there is high peak as same as in 3rd quater.
We have to stationarize the time series data before feeding to alogrithms. From the above figures, it is clearly seen that there is seasonlity and trend in series. In other words, the series is non-statinary meaning that the mean, variance and autocorrelation change over time.
But, lets do statistic to show that the series is stationary or non-stationary(shows seasonality or trend). For this, the augmented Dickey-Fuller Test is used. It tests the hypothesis. Here, the null hyothesis is 'non-stationary'. We will use the significance level of 0.05 percentage (95% confidence level).
# stationary check
def adfuller_test(time_series):
"""
time_series= single time series data
returns an ADF report
"""
# create adfuller report
result = adfuller(time_series, autolag='AIC')
#assigin the each value with name and store as pands series
labels = ['ADF_test','p-value','lags used','observations']
out = pd.Series(result[0:4],index=labels) #create series
for key,val in result[4].items(): # add items(critical values)
out[f'critical value ({key})']=val
#print out
print(out)
#check hypothesis
if np.round(result[1], 2) <= 0.05:
print("Strong evidence against to reject null hypothesis")
print("It has no unit root and is stationary")
else:
print("Weak evidence against to reject null hypothesis")
print("It has unit root and is non-stationary")
#check stationary
adfuller_test(df['Thousands of Passengers'])
Result tells that this time series data is non-stationary. So, we have to make it stationary. I have used differencing technique to make it stationary.
Lets build the models from different alogrithms, like AR,MA, ARMA/ARIMA, SARIMA.
Note: we know clearly that this dataset is non-stationary(shows trend or seasonality), which is not suitable to model with AR, MA, ARMA since they are not able to learn trend and seasonl in dataset. But also lets try how it do in prediction.
'''
Note: this is just to understand for myself at what differencing the series become stationary
'''
# return stationary data with 95% confidence level(0.05 significance level)
def return_stationary_series(series, k=2):
'''
series = time series data
k = differecing time
return stationary time seres data
'''
for d in range(1, k):
df_stat = diff(df['Thousands of Passengers'], k_diff = d) # differencing data
# check p-value
if np.round(adfuller(df_stat)[1]) <= 0.05: # round with 2 decimal values
print(f'Now the time-series is stationary at {d} difference')
return df_stat
break
#return stationary series and plot
stationary_series =return_stationary_series(df['Thousands of Passengers'], 10)
stationary_series.plot(figsize=(17,7));
Now, the series looks as stationary. We can test with calling the function adfuller_test().
# testing for stationary
adfuller_test(stationary_series)
#convert to dataframe
stat_data = stationary_series.to_frame()
#split test(36 months) and train(remaining months)
train_data_stationary = stat_data.iloc[:-36]
test_data_stationary = stat_data.iloc[-36:]
#original data(non-stationary)
train_data= df.iloc[:-36]
test_data= df.iloc[-36:]
In this autoregression model, the linear combination of past values of the series are happened. It is calculated against a set of lagged values of order 𝑝 . We are using auto_arima() function to find the best 'p' for AR in this dataset.
#suggest model
auto_arima(stat_data, seasonal=False).summary() # use stationary data
# AR model
model_ar = AR(train_data_stationary['Thousands of Passengers']) # use stationary data
#best_lag = model_ar.select_order(maxlag= 20,trend='nc', ic='aic') # select best maxlag
#print(f'best lag: {best_lag}')
# fit model with suggested p
model_ar_fit = model_ar.fit(maxlag= 2)
#predict test data
ar_prediction = model_ar_fit.predict(start= test_data_stationary.index[0],
end= test_data_stationary.index[-1],
dynamic=False)
ar_prediction.head()
Since our prediction values are already scaled with 1 differencing, we have to change it back to orginal scale before ploting and forecasting. You can obtained this result either by 1) summing columns (original value + stationary values/predictived values) or 2) just recent values of train data + prediction values using cumsum() function
#create dataframe
test_data_stationary['Passengers(original)'] = test_data['Thousands of Passengers'] # original scale
# store predicted value in dataframe
test_data_stationary['Prediction_1d_AR'] = ar_prediction
# add most recent value from training dataset and do cumsum in stationary values
test_data_stationary['Inverted_1d_AR'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1d_AR'].cumsum()
#test_data_stationary['Inverted_Values'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1:-1].values + test_data_stationary['Prediction_1d_scale'].values
test_data_stationary.head()
# create plot with test and predicted
test_data.plot(legend= True, label= 'test data')
test_data_stationary['Inverted_1d_AR'].plot(legend= True, label='AR prediction', figsize=(13,5))
plt.show()
Pretty bad, right? We already know that this data has some seasonality. So AR model cann't address that problem. From the above figure also we can see that this model is doing very poor in capturing the peak and down parts.
# r2_score
r2_result = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_AR'])
r2_result
# mean absolute error
mean_absolute_error(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_AR'])
As we look at r2_score values, it is in negative which means poor prediciton. So, we can say that AR is too simple model for this dataset and it doesn't address the trend and seasonality in the series.
# train with whole dataset(stationary)
ar_final_model = AR(stat_data['Thousands of Passengers']).fit(maxlag= 2, ic='aic')
#forecast next 24
ar_forecast = ar_final_model.predict(start= stat_data.index[-1]+1,
end=stat_data.index[-1]+24,
dynamic=False)
#invert to original scale
original_scale = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + ar_forecast.cumsum()
#create plots
ax= df['Thousands of Passengers'].plot(legend= True, label='Original data', title='Next 24 month Forecast', figsize=(15,7))
original_scale.plot(legend= True, label='AR_Forecast')
ax.legend(loc=2);
Very bad forecasting, right?
MA model is a linear regression of the current value of the series against current and previous (observed) white noise error.
#fit model
ma_model = ARMA(train_data_stationary['Thousands of Passengers'], order=(0,4)).fit()
#predict test data
ma_prediction = ma_model.predict(start= test_data_stationary.index[0],
end= test_data_stationary.index[-1],
dynamic=False)
# put into dataframe
test_data_stationary['Prediction_1D_MA'] = ma_prediction
#invert data to original scale ( since predicted result is 1 differencing)
test_data_stationary['Inverted_1d_MA'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1D_MA'].cumsum()
test_data_stationary.head()
#create plots between test and prediction
test_data.plot(legend=True, label = 'Test Data', figsize=(15,7)) # testdata plot
test_data_stationary['Inverted_1d_MA'].plot(legend=True, label = 'MA Predicton')
plt.show()
# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_MA'])
arma_r2_score
It is the combinatin of AR and MA. AR is the linear combination of lagged values of order 'p' and MA uses the past forecast errors of order 'q' in a regression-like mode .
We can directly fit original series with ARIMA model or can be obtained the same result by making data stationary at first and pass it with the model ARMA. We have already changed into stationary data. So. lets use that data with ARMA.
#tune parameter for ARMA
auto_arima(train_data_stationary, seasonal=False).summary()
It suggests ARMA(2, 4) model
#fit model
arma_model = ARMA(train_data_stationary['Thousands of Passengers'], order=(2,4)).fit()
#predict test data
arma_prediction = arma_model.predict(start= test_data_stationary.index[0],
end= test_data_stationary.index[-1],
dynamic=False)
# put into dataframe
test_data_stationary['Prediction_1D_ARMA'] = arma_prediction
#invert data to original scale ( since predicted result is 1 differencing)
test_data_stationary['Inverted_1d_ARMA'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1D_ARMA'].cumsum()
test_data_stationary.head()
#create plots between test and prediction
test_data.plot(legend=True, label = 'Test Data',figsize=(15,7)) # testdata plot
test_data_stationary['Inverted_1d_ARMA'].plot(legend=True, label = 'ARMA Predicton')
plt.show()
Looks this model also doesn't learn much for this dataset. It predict in linear behaviour.
test_data_stationary['Inverted_1d_ARMA'].mean()
test_data.mean()
# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_ARMA'])
arma_r2_score
# absolute mean error
mean_absolute_error(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_ARMA'])
It has r2_score = 0.14. Still, model doesn't predict well. And, it is true also since we have dataset having seasonality which cann't handle by ARMA.
# train with whole dataset(stationary)
arma_model = ARMA(stat_data['Thousands of Passengers'], order=(2,4)).fit()
#forecast next 24
arma_forecast = arma_model.predict(start=stat_data.index[-1],
end=stat_data.index[-1]+24,
dynamic=False)
#orignal data
original_arma_forecast= df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + arma_forecast.cumsum()
#create plots
ax= df['Thousands of Passengers'].plot(legend= True, label='Original_Data', title='Next 24 months Forecast', figsize=(15,7))
original_arma_forecast.plot(legend= True, label='ARMA_Forecast')
ax.legend(loc=2);
This model combines AR, I and MA. Here, AR = Autoregression, I = Integration and MA = Moving Average. This model directly converts the non-stationary series to stationary. This is equavalent to ARMA(p,q) after differencing the original series with d in ARIMA(p,d,q). This model address trend but not seasonality.
#tune parameter for ARIMA
auto_arima(df['Thousands of Passengers'], seasonal=False).summary()
ARIMA(p,d,q) is same as ARMA(p,q) after differncing data with order 'd'. We can see this later.
#build model
arima_mode = ARIMA(train_data['Thousands of Passengers'], order=(2,1,4)).fit()
#forecast next 24
arima_prediction = arima_mode.predict(start=test_data.index[0],
end=test_data.index[-1],
typ='levels',
dynamic=False)
#create plots with test_data and prediction
test_data['Thousands of Passengers'].plot(legend=True, label='Test_Set', figsize=(13,7))
arima_prediction.plot(legend=True, label='ARIMA prediction')
plt.show()
# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], arima_prediction)
arma_r2_score
#mean_absolute
absmean = mean_absolute_error(test_data['Thousands of Passengers'], arima_prediction)
absmean
We got the exact same result from the two models ARMA(2,4) with data after differencing '1' and ARIMA(2,1,4) with out differencing data(original).
# train with whole dataset
arima_model_f = ARIMA(df['Thousands of Passengers'], order=(2, 1, 4)).fit()
#forecast next 24 months
arima_prediction_f = arima_model_f.predict(start=len(df),
end=len(df)+23,
typ='levels',
dynamic=False)
#create plots
ax= df['Thousands of Passengers'].plot(title= 'Next 24 months Forecast',legend= True, label='Original', figsize=(15,7))
arima_prediction_f.plot(legend= True, label='ARIMA_Forecast')
ax.legend(loc=2)
plt.show()
This model addresses the seasonality of series. It has additional paramerter (P,D,Q)m, where P,D,Q represent the seasonal regression, differencing and moving average coefficients respectively and 'm' represents the number of data points (rows) in each seasonal cycle.
#tune parameter for SARIMAX
auto_arima(df['Thousands of Passengers'], seasonal=True).summary()
# use the sugessted parameters
sarima_mode = SARIMAX(df['Thousands of Passengers'], order=(2, 1, 2)).fit()
#forecast next 24
sarima_prediction = sarima_mode.predict(start=len(train_data),
end=len(train_data)+len(test_data)-1,
typ='levels',
dynamic=False)
#create plots
ax= df['Thousands of Passengers'].iloc[-80:].plot(legend= True, label='Original',figsize=(15,7))
test_data['Thousands of Passengers'].plot(legend= True, label='Test_data')
sarima_prediction.plot(legend= True, label='SARIMA_prediction')
ax.legend(loc=2);
#r2_score
sarima_r2_score= r2_score(test_data['Thousands of Passengers'], sarima_prediction)
sarima_r2_score
#mean_absolute
absmean = mean_absolute_error(test_data['Thousands of Passengers'], sarima_prediction)
absmean
# test data std
test_data.std()
We can see that SARIMA model gives better result than ARIMA since SARIMA model address seasonlity, but lack in ARIMA model.
# train with whole dataset
sarima_model = SARIMAX(df['Thousands of Passengers'], order=(2, 1, 2)).fit()
#forecast next 24 months
sarima_prediction = sarima_model.predict(start=len(df),
end=len(df)+24,
dynamic=False)
#create plots
ax= df['Thousands of Passengers'].plot(title= 'Next 24 months Forecast',legend= True, label='Original',figsize=(15,7))
sarima_prediction.plot(legend= True, label='SARIMA_prediction')
ax.legend(loc=2)
plt.show()
When we comapare these models(AR, MA, ARMA/ARIMA, SARIMA), the result get better from SARIMA model. I have also tried this dataset with Holt-Winter method which did very good result with additive trend(mean_absolute_error = 25.8, bit lesser than SARIMA result). You can find the Holt-Winter method result in the another jupyter notebook 'Time-Series-Analysis (Statsmodels).ipynb'
Multi-variate Time series Analysis
If we want to predict the multiple time series at the same time stamps, we can use VAR and VARMA model. Each series should be correlated with each others meaning they have each other influences in the prediction. In other words, the series(variables) are endogenous in nature.
In for this task, we will use airpollution dataset. This dataset consists of hourly update of 12 attributes between Jan 1st, 2010 to Dec 31st, 2014. The attributes consists of Datetime, pollution(pm2.5) dew point(Dewp), temperature(temp), Pressure (PRES), wind direction(cbwd), wind speed(lws), snow(ls) and rain(lr). Dataset is taken from here.
#load dataset
df_2 = pd.read_csv('Data/airpollution.csv')
df_2.head()
df_2.dtypes
#perprocess data
def clean_data(data):
"""
data: panda dataframe
return clean dataframe
"""
# make datatime as index
data.set_index( pd.to_datetime(data[['year','month', 'day', 'hour']]), inplace= True)
data.index.name = 'Date_time' # index name
# lets drop columns 'No', 'year', 'month', 'day' ,'hour'
data.drop(columns=['No', 'year', 'month', 'day' ,'hour'], inplace=True)
# rename the columns name
data.rename(columns={'pm2.5':'pollution',
'DEWP':'dewp',
'TEMP':'temp',
'PRES': 'press',
'cbwd': 'wnd_dir',
'Iws' :'wnd_spd',
'Is': 'snow',
'Ir' : 'rain'
}, inplace=True)
#fill NaN with 0 and drop first 24 rows(since column: pm2.5 NaN values first 24)
data.fillna(0, inplace=True)
data = data[24:]
return data
# call clean_data() function
clean_data = clean_data(df_2)
clean_data.head()
# check null
clean_data.isnull().sum()
#check data types
clean_data.info()
We can see that there is no frequency set in the data time index. Lets set frequency based on hour since we have hourly data.
#check data types
clean_data.index.freq = 'H'
clean_data.info()
# Visualize all time series
fig, axes = plt.subplots(nrows= int(len(clean_data.columns)/2), ncols=2, dpi=400, figsize=(10,7))
for i, ax in enumerate(axes.flatten()): # loop all subplots
data = clean_data[clean_data.columns[i]]
ax.plot(data, linewidth=1)
ax.autoscale(axis='x',tight=True) # tight fit to x-axis
# Decorations
ax.set_title(clean_data.columns[i])
ax.tick_params(labelsize=6, colors='r')
plt.tight_layout(); # fit suplots nicely in the figure
From the above plots, we can see that pretty much every time series are stationary(not showing trend or seasonality), but not sure about this. So, we will do statistical test to check for stationary later in this work.
#viusalize
series_tovisalize= ['pollution', 'temp']
for x in series_tovisalize:
clean_data[x].plot( legend='True', figsize=(10,7))
For this, I have used adf test to check whether the series are stationary( has no unit root) or non-stationary(has unit root). We will take only two series to full fill our work.
# stationary check for series 'pollution' and 'temp'
for x in clean_data[['pollution', 'temp']].columns:
print(f'Stationary check for {x}:')
adfuller_test(clean_data[x])
print()
Based on the result, both series are stationary.
Since VAR model is based on the assumption that each time series influences each other meaning the current value of series can be predicted using the past values itself and other series.
Thus, we will use Granger’s Causality Test to find the series those have influenced each other. Here, we are using this test on both way, for example, test of(a,b) and (b,a) to see whether they are influenced each other or not.
from statsmodels.tsa.stattools import grangercausalitytests
# tests causation to pollution by temp
test =grangercausalitytests(clean_data[[ 'pollution', 'temp']], 3)
test
Note: we can see that temp have some influence (casuation) in pollution series
# tests causation to temp by pollution
test_1 =grangercausalitytests(clean_data[[ 'temp','pollution']], 3)
test_1
Note: From this result, we conclude that pollution has causation to temp. So, both series have influence to each other.
#latest 72 hours for testdset and remaining trainset
train_endo = clean_data.iloc[:-72, [0,2]] # columns= pollution, temp
test_endo = clean_data.iloc[-72:, [0,2]]
#create VAR model
var_model =VAR(train_endo)
var_result = var_model.fit(maxlags=100, ic='aic')# var lag order selection
var_predica = var_result.forecast(y = train_endo.values[-var_result.k_ar:], steps=len(test_endo)) # requires lag_order previous values
#prediction in dataframe
var_pred_dfa = pd.DataFrame({ 'pollution_pd': var_predica[:, 0],'temp_pd': var_predica[:, 1],})
#set index
var_pred_dfa.set_index(pd.date_range('2014-12-29 00:00:00', periods= len(test_endo), freq='H'), inplace=True)
var_pred_dfa.head(3)
#visualize prediction and test series
var_pred_dfa['temp_pd'].plot( legend=True);
test_endo['temp'].plot(figsize=(10,7), legend=True);
#visualize prediction and test series
var_pred_dfa['pollution_pd'].plot(label= 'pollution_pred', legend=True);
test_endo['pollution'].plot(figsize=(10,7), label = 'original_pollution',legend=True);
#abs mean error
for x in zip( test_endo.columns, var_pred_dfa.columns):
print(f'{x[0]} mean_absolute_error:')
print(np.round(mean_absolute_error(test_endo[x[0]], var_pred_dfa[x[1]]),3))
print()
It is the combinatin of VAR and MA. VAR is the linear combination of lagged values of order 'p' and MA uses the past forecast errors of order 'q' in a regression-like mode . As like VAR, this model is also used for multile variables forecast at same time. To find the optimal parameters (p and q), we will use auto_arima() function.
# find parameter(p,q)
for x in clean_data.iloc[:,[0,2]].columns:
print(f'Order for {x}')
print(auto_arima(clean_data[x], maxiter=1000))
print()
Note: Above results show that 'temp' is non-stationary and 'pollution' is stationary. Lets feed these series to model with out differencing since two series are already showed stationary by adf test with 95% condifidence level.
Since the suggested orders are different for both series, so we have to try both and take the model as final model which give less AIC score.
# try build model with order=(3,2)
varma_model =VARMAX(train_endo, order=(3,2), trend='c')
varma_result = varma_model.fit(maxiter=1000, disp= False)
varma_result.summary()
# try with order=(3,3)
varma_model_2 =VARMAX(train_endo, order=(3,3), trend='c')
varma_result_2 = varma_model_2.fit(maxiter=1000, disp= False)
varma_result_2.summary()
Note: Model with order(3,3) gave the less AIC score. So, we are using this for our final model.
#visualize pollution series
plt.figure(figsize=(9,6))
varma_forecast['pollution'].plot(label= 'forecast_pollu', legend=True);
test_endo['pollution'].plot( label= 'original_pollu',legend=True)
plt.show()
#visualize temp series
plt.figure(figsize=(9,6))
varma_forecast['temp'].iloc[:72].plot(label= 'forecast_temp', legend=True)
test_endo['temp'].plot( label= 'original_temp',legend=True)
plt.show()
#abs mean error
for x in zip( test_endo.columns, varma_forecast.columns):
print(f'{x[0]} mean_absolute_error:')
print(np.round(mean_absolute_error(test_endo[x[0]], varma_forecast[x[1]]),3))
print()
In this work, I have learnt how to model univariate as well as multivariate time series with the statistical models. In addition, stationary check as well as Granger’s Causality test were perfomed to check the series whether they are sutiable to feed them direclty in the model or not.