Time Series Analysis and Forecasting using Statsmodels

In this notebook, we explore the forecasting models from statsmodel, for example, AR, MA, ARMA/ARIMA, SARIMA, VAR and VARIMAX. The main motivation behind doing this task is to understand how to model the timeseries data with these algorithms. In this notebook, there is two sections:
  1) univariate time series foreceasting
  2) multi-variate forecasting(forecasting multiple series at same time).

To do univariate forecasting, I have used the dataset- 'airline passenger' to all univariate algoriths(AR, MA, ARMA/ARIMA, SARIMA). Similarly, VAR and VARIMAX are used for modeling multi-variate series. And, lastly figure out which algorithm is suitable for this data type. In addition, this notebook gives how to apply the stationarity check (series has unit root or not) with augmented Dickey-Fuller Test.

In [1]:
#import required libraries
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from pylab import rcParams #set figure size

from pmdarima import auto_arima # for finding arima orders

# for metric
from statsmodels.tools.eval_measures import rmse
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score

from statsmodels.tsa.stattools import adfuller # for stationarity test
from statsmodels.tsa.statespace.tools import diff 
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf, month_plot,quarter_plot #plots

#for creating models
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARMA, ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.api import VAR, VARMAX
In [2]:
#ignore warnings
import warnings
warnings.filterwarnings("ignore")
In [ ]:
 

Section 1

Univariate Time Series Anlaysis

In [3]:
#load dataset
df= pd.read_csv('Data/airline_passengers.csv', index_col='Month', parse_dates=True)
df.index.freq = 'MS'# since data is monthly start frequency
df.head()
Out[3]:
Thousands of Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
In [ ]:
 
In [1189]:
#lets visualize the data
ax = df['Thousands of Passengers'].plot(figsize=(12,5), legend= True, label='Original')
df['Thousands of Passengers'].rolling(window=12).mean().plot(legend= True, label='12-Month-Mean')
df['Thousands of Passengers'].rolling(window=12).std().plot(legend= True, label='12-Month-Std')

ax.autoscale(axis='x',tight=True) # fit to x-axis
ylabel='No. of passengers'
xlabel = 'Months'
ax.set(xlabel=xlabel, ylabel=ylabel) #name axis
plt.show()

From the plot, it clearly shows that this is non-stationary series since the mean and standard deviation keep on variation with resect to time period. Even, we can see that this series has clear upward trend. We will perfome statistical test to test stationary or not later in this work.

In [ ]:
 

ETS Decompostion

Lets look the general behaviour in series. From ETS decomposition, we may get idea/clue whether time series has trend, seasonlaity or not.

In [715]:
# lets do seasonal decompostion
rcParams['figure.figsize']=15,7 # set figure size
sea_decom = seasonal_decompose(df['Thousands of Passengers'], model='additive')
sea_decom.plot();

Figures show clear upward trend and may be some seasonlity

In [ ]:
 

Auto/Partial Correlation Plot

These plots are usually used for finding the 'p' and 'q' for ARIMA model, but in this notebook instead of using this classical method we will use auto_arima() function. To check whether or not there is correlation with the shiffted samples in the time series, these plots are also useful.

In [728]:
#create auto correaltion plot
plot_acf(df['Thousands of Passengers'], lags=40, title='ACF plot')
plt.show()

Figure shows there is correlation as well as seasonality in certian periods.

In [1204]:
#create partial correaltion plot
plot_pacf(df['Thousands of Passengers'], lags=40, title='ACF plot');

Seasonality and quaterly plot

Lets figure out, which months and qauterly have low and high passenger.

In [1011]:
#monthly plot
month_plot(df['Thousands of Passengers']);
In [1012]:
#quaterly plot
quarter_plot(df['Thousands of Passengers'].resample(rule='Q').mean());

From both quaterly and montly plots, we can see there is some up and down in the airplane passenger. For the month of JUNE and JULY there is high peak as same as in 3rd quater.

In [ ]:
 
In [ ]:
 

Stationarized Time Series

We have to stationarize the time series data before feeding to alogrithms. From the above figures, it is clearly seen that there is seasonlity and trend in series. In other words, the series is non-statinary meaning that the mean, variance and autocorrelation change over time.

But, lets do statistic to show that the series is stationary or non-stationary(shows seasonality or trend). For this, the augmented Dickey-Fuller Test is used. It tests the hypothesis. Here, the null hyothesis is 'non-stationary'. We will use the significance level of 0.05 percentage (95% confidence level).

In [17]:
# stationary check
def adfuller_test(time_series):
    """
     time_series= single time series data
     returns an ADF report
     
    """
    
    # create adfuller  report
    result = adfuller(time_series, autolag='AIC') 
    
    
    #assigin the each value with name and store as pands series
    labels = ['ADF_test','p-value','lags used','observations']
    out = pd.Series(result[0:4],index=labels) #create series

    for key,val in result[4].items(): # add items(critical values) 
        out[f'critical value ({key})']=val
        
   
    #print out
    print(out)
    
   #check hypothesis 
    if np.round(result[1], 2) <= 0.05:
        print("Strong evidence against  to reject  null hypothesis")
        print("It has no unit root and is stationary")
    else:
        print("Weak evidence against  to reject null hypothesis")
        print("It has unit root and  is non-stationary")
In [291]:
#check stationary 
adfuller_test(df['Thousands of Passengers'])
ADF_test                  0.815369
p-value                   0.991880
lags used                13.000000
observations            130.000000
critical value (1%)      -3.481682
critical value (5%)      -2.884042
critical value (10%)     -2.578770
dtype: float64
Weak evidence against  to reject null hypothesis
It has unit root and  is non-stationary

Result tells that this time series data is non-stationary. So, we have to make it stationary. I have used differencing technique to make it stationary.

In [ ]:
 

Models

Lets build the models from different alogrithms, like AR,MA, ARMA/ARIMA, SARIMA.

Note: we know clearly that this dataset is non-stationary(shows trend or seasonality), which is not suitable to model with AR, MA, ARMA since they are not able to learn trend and seasonl in dataset. But also lets try how it do in prediction.

In [16]:
'''
Note: this is just to understand for myself at what differencing the series become stationary
'''

# return stationary data with 95% confidence level(0.05 significance level)
def return_stationary_series(series, k=2):
    '''
    series = time series data
    k = differecing time 
    return  stationary time seres data
    '''
    for d in range(1, k):
        df_stat = diff(df['Thousands of Passengers'], k_diff = d) # differencing data

        # check p-value
        if np.round(adfuller(df_stat)[1]) <= 0.05:  # round with 2 decimal values
            print(f'Now the time-series is stationary at {d} difference')
            return df_stat
            break
        
In [294]:
#return stationary series and plot
stationary_series =return_stationary_series(df['Thousands of Passengers'], 10)
stationary_series.plot(figsize=(17,7));
Now the time-series is stationary at 1 difference

Now, the series looks as stationary. We can test with calling the function adfuller_test().

In [295]:
# testing for stationary
adfuller_test(stationary_series)
ADF_test                 -2.829267
p-value                   0.054213
lags used                12.000000
observations            130.000000
critical value (1%)      -3.481682
critical value (5%)      -2.884042
critical value (10%)     -2.578770
dtype: float64
Strong evidence against  to reject  null hypothesis
It has no unit root and is stationary
In [ ]:
 

Train_Test split

In [396]:
#convert to dataframe
stat_data = stationary_series.to_frame() 

#split test(36 months) and train(remaining months)
train_data_stationary = stat_data.iloc[:-36] 
test_data_stationary = stat_data.iloc[-36:] 

#original data(non-stationary)
train_data= df.iloc[:-36]
test_data= df.iloc[-36:] 
 
In [ ]:
 

Autoregression(AR) Model

In this autoregression model, the linear combination of past values of the series are happened. It is calculated against a set of lagged values of order 𝑝 . We are using auto_arima() function to find the best 'p' for AR in this dataset.

In [397]:
#suggest  model
auto_arima(stat_data, seasonal=False).summary() # use stationary data
Out[397]:
ARMA Model Results
Dep. Variable: y No. Observations: 143
Model: ARMA(2, 4) Log Likelihood -661.292
Method: css-mle S.D. of innovations 23.715
Date: Tue, 19 Nov 2019 AIC 1338.585
Time: 21:04:03 BIC 1362.288
Sample: 0 HQIC 1348.216
coef std err z P>|z| [0.025 0.975]
const 2.6854 0.134 19.979 0.000 2.422 2.949
ar.L1.y 0.9142 0.082 11.131 0.000 0.753 1.075
ar.L2.y -0.7709 0.080 -9.629 0.000 -0.928 -0.614
ma.L1.y -0.8356 0.101 -8.265 0.000 -1.034 -0.637
ma.L2.y 0.3271 0.122 2.691 0.008 0.089 0.565
ma.L3.y 0.3086 0.114 2.699 0.008 0.085 0.533
ma.L4.y -0.8000 0.079 -10.112 0.000 -0.955 -0.645
Roots
Real Imaginary Modulus Frequency
AR.1 0.5929 -0.9724j 1.1389 -0.1628
AR.2 0.5929 +0.9724j 1.1389 0.1628
MA.1 1.0001 -0.0000j 1.0001 -0.0000
MA.2 0.3178 -0.9481j 1.0000 -0.1985
MA.3 0.3178 +0.9481j 1.0000 0.1985
MA.4 -1.2500 -0.0000j 1.2500 -0.5000
In [ ]:
 

Build AR model

In [757]:
# AR model
model_ar = AR(train_data_stationary['Thousands of Passengers']) # use stationary data
#best_lag = model_ar.select_order(maxlag= 20,trend='nc', ic='aic') # select best maxlag
#print(f'best lag: {best_lag}')

# fit model with suggested p
model_ar_fit = model_ar.fit(maxlag= 2) 

#predict test data
ar_prediction = model_ar_fit.predict(start= test_data_stationary.index[0], 
                                     end= test_data_stationary.index[-1], 
                                     dynamic=False)
ar_prediction.head()
Out[757]:
1958-01-01    23.500473
1958-02-01     2.308398
1958-03-01    -3.271831
1958-04-01     0.140311
1958-05-01     2.762978
Freq: MS, dtype: float64
In [ ]:
 

Invert Transformation

Since our prediction values are already scaled with 1 differencing, we have to change it back to orginal scale before ploting and forecasting. You can obtained this result either by 1) summing columns (original value + stationary values/predictived values) or 2) just recent values of train data + prediction values using cumsum() function

In [758]:
#create dataframe
test_data_stationary['Passengers(original)'] = test_data['Thousands of Passengers'] # original scale 

# store predicted value in dataframe
test_data_stationary['Prediction_1d_AR'] = ar_prediction 

# add most recent  value from training dataset and do cumsum in stationary values
test_data_stationary['Inverted_1d_AR'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1d_AR'].cumsum()

#test_data_stationary['Inverted_Values'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1:-1].values + test_data_stationary['Prediction_1d_scale'].values
test_data_stationary.head()
Out[758]:
Thousands of Passengers Passengers(original) Prediction_1d_AR Inverted_1d_AR Prediction_1D_MA Inverted_1d_MA Prediction_1D_ARMA Inverted_1d_ARMA
Month
1958-01-01 4.0 340 23.500473 359.500473 34.548432 370.548432 49.208305 385.208305
1958-02-01 -22.0 318 2.308398 361.808871 4.991931 375.540362 28.741069 413.949374
1958-03-01 44.0 362 -3.271831 358.537040 9.462509 385.002871 10.349717 424.299091
1958-04-01 -14.0 348 0.140311 358.677351 -9.970430 375.032441 -34.201682 390.097409
1958-05-01 15.0 363 2.762978 361.440329 2.501115 377.533556 -37.225431 352.871978
In [ ]:
 
In [759]:
# create plot with test and predicted
test_data.plot(legend= True, label= 'test data')
test_data_stationary['Inverted_1d_AR'].plot(legend= True, label='AR prediction', figsize=(13,5))
plt.show()

Pretty bad, right? We already know that this data has some seasonality. So AR model cann't address that problem. From the above figure also we can see that this model is doing very poor in capturing the peak and down parts.

In [ ]:
 

Evaluate the Model

In [760]:
# r2_score
r2_result = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_AR'])
r2_result
Out[760]:
-0.026951254751961695
In [761]:
# mean  absolute error
mean_absolute_error(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_AR'])
Out[761]:
56.63657955498565
In [ ]:
 

Note:

As we look at r2_score values, it is in negative which means poor prediciton. So, we can say that AR is too simple model for this dataset and it doesn't address the trend and seasonality in the series.

In [ ]:
 

Forecasting

In [762]:
# train with whole dataset(stationary)
ar_final_model = AR(stat_data['Thousands of Passengers']).fit(maxlag= 2, ic='aic')


#forecast next 24
ar_forecast = ar_final_model.predict(start= stat_data.index[-1]+1, 
                                      end=stat_data.index[-1]+24, 
                                      dynamic=False)

#invert to original scale
original_scale = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + ar_forecast.cumsum()



#create plots
ax=  df['Thousands of Passengers'].plot(legend= True, label='Original data', title='Next 24 month Forecast', figsize=(15,7))
original_scale.plot(legend= True, label='AR_Forecast')

ax.legend(loc=2);

Very bad forecasting, right?

In [ ]:
 

Moving Average(MA) model

MA model is a linear regression of the current value of the series against current and previous (observed) white noise error.

In [404]:
#fit model 
ma_model = ARMA(train_data_stationary['Thousands of Passengers'], order=(0,4)).fit()

#predict test data
ma_prediction = ma_model.predict(start= test_data_stationary.index[0], 
                                     end= test_data_stationary.index[-1],
                                     dynamic=False)

# put into dataframe
test_data_stationary['Prediction_1D_MA'] = ma_prediction

#invert data to original scale ( since predicted result is 1 differencing)
test_data_stationary['Inverted_1d_MA'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1D_MA'].cumsum()
test_data_stationary.head()
Out[404]:
Thousands of Passengers Passengers(original) Prediction_1d_AR Inverted_1d_AR Prediction_1D_MA Inverted_1d_MA
Month
1958-01-01 4.0 340 23.500473 359.500473 34.548432 370.548432
1958-02-01 -22.0 318 2.308398 361.808871 4.991931 375.540362
1958-03-01 44.0 362 -3.271831 358.537040 9.462509 385.002871
1958-04-01 -14.0 348 0.140311 358.677351 -9.970430 375.032441
1958-05-01 15.0 363 2.762978 361.440329 2.501115 377.533556
In [ ]:
 
In [406]:
#create plots between test and prediction
test_data.plot(legend=True, label = 'Test Data', figsize=(15,7)) # testdata plot
test_data_stationary['Inverted_1d_MA'].plot(legend=True, label = 'MA Predicton')
plt.show()
In [407]:
# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_MA'])
arma_r2_score
Out[407]:
0.18479185296233125
In [ ]:
 

Autoregression Moving Averages(ARMA) model

It is the combinatin of AR and MA. AR is the linear combination of lagged values of order 'p' and MA uses the past forecast errors of order 'q' in a regression-like mode .

We can directly fit original series with ARIMA model or can be obtained the same result by making data stationary at first and pass it with the model ARMA. We have already changed into stationary data. So. lets use that data with ARMA.

In [408]:
#tune parameter  for ARMA
auto_arima(train_data_stationary, seasonal=False).summary()
Out[408]:
ARMA Model Results
Dep. Variable: y No. Observations: 107
Model: ARMA(2, 4) Log Likelihood -467.805
Method: css-mle S.D. of innovations 18.250
Date: Tue, 19 Nov 2019 AIC 951.609
Time: 21:05:40 BIC 972.992
Sample: 0 HQIC 960.277
coef std err z P>|z| [0.025 0.975]
const 2.5392 0.157 16.165 0.000 2.231 2.847
ar.L1.y 0.9160 0.078 11.670 0.000 0.762 1.070
ar.L2.y -0.7821 0.073 -10.642 0.000 -0.926 -0.638
ma.L1.y -0.8360 0.109 -7.645 0.000 -1.050 -0.622
ma.L2.y 0.3351 0.151 2.226 0.028 0.040 0.630
ma.L3.y 0.2955 0.125 2.364 0.020 0.051 0.540
ma.L4.y -0.7943 0.110 -7.240 0.000 -1.009 -0.579
Roots
Real Imaginary Modulus Frequency
AR.1 0.5856 -0.9673j 1.1308 -0.1634
AR.2 0.5856 +0.9673j 1.1308 0.1634
MA.1 1.0001 -0.0000j 1.0001 -0.0000
MA.2 0.3153 -0.9490j 1.0000 -0.1989
MA.3 0.3153 +0.9490j 1.0000 0.1989
MA.4 -1.2587 -0.0000j 1.2587 -0.5000

It suggests ARMA(2, 4) model

In [409]:
#fit model 
arma_model = ARMA(train_data_stationary['Thousands of Passengers'], order=(2,4)).fit()

#predict test data
arma_prediction = arma_model.predict(start= test_data_stationary.index[0], 
                                     end= test_data_stationary.index[-1],
                                     dynamic=False)

# put into dataframe
test_data_stationary['Prediction_1D_ARMA'] = arma_prediction


#invert data to original scale ( since predicted result is 1 differencing)
test_data_stationary['Inverted_1d_ARMA'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1D_ARMA'].cumsum()
test_data_stationary.head()
Out[409]:
Thousands of Passengers Passengers(original) Prediction_1d_AR Inverted_1d_AR Prediction_1D_MA Inverted_1d_MA Prediction_1D_ARMA Inverted_1d_ARMA
Month
1958-01-01 4.0 340 23.500473 359.500473 34.548432 370.548432 49.208305 385.208305
1958-02-01 -22.0 318 2.308398 361.808871 4.991931 375.540362 28.741069 413.949374
1958-03-01 44.0 362 -3.271831 358.537040 9.462509 385.002871 10.349717 424.299091
1958-04-01 -14.0 348 0.140311 358.677351 -9.970430 375.032441 -34.201682 390.097409
1958-05-01 15.0 363 2.762978 361.440329 2.501115 377.533556 -37.225431 352.871978
In [412]:
#create plots between test and prediction
test_data.plot(legend=True, label = 'Test Data',figsize=(15,7)) # testdata plot

test_data_stationary['Inverted_1d_ARMA'].plot(legend=True, label = 'ARMA Predicton')
plt.show()

Looks this model also doesn't learn much for this dataset. It predict in linear behaviour.

In [413]:
test_data_stationary['Inverted_1d_ARMA'].mean()
Out[413]:
416.30909725267526
In [1256]:
test_data.mean()
Out[1256]:
Thousands of Passengers    428.5
dtype: float64

Evaluate model

In [1257]:
# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_ARMA'])
arma_r2_score
Out[1257]:
0.1418787904840343
In [1301]:
# absolute mean error
mean_absolute_error(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_ARMA'])
Out[1301]:
57.26592096364511

It has r2_score = 0.14. Still, model doesn't predict well. And, it is true also since we have dataset having seasonality which cann't handle by ARMA.

In [ ]:
 

Future Forcast

In [416]:
# train with whole dataset(stationary)
arma_model = ARMA(stat_data['Thousands of Passengers'], order=(2,4)).fit()


#forecast next 24
arma_forecast = arma_model.predict(start=stat_data.index[-1], 
                                      end=stat_data.index[-1]+24, 
                                      dynamic=False)

#orignal data
original_arma_forecast= df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + arma_forecast.cumsum()

#create plots
ax=  df['Thousands of Passengers'].plot(legend= True, label='Original_Data', title='Next 24 months Forecast', figsize=(15,7))
original_arma_forecast.plot(legend= True, label='ARMA_Forecast')
ax.legend(loc=2);
In [ ]:
 

Autoregression Integration Moving Average(ARIMA)

This model combines AR, I and MA. Here, AR = Autoregression, I = Integration and MA = Moving Average. This model directly converts the non-stationary series to stationary. This is equavalent to ARMA(p,q) after differencing the original series with d in ARIMA(p,d,q). This model address trend but not seasonality.

In [417]:
#tune  parameter for ARIMA
auto_arima(df['Thousands of Passengers'], seasonal=False).summary()
Out[417]:
ARIMA Model Results
Dep. Variable: D.y No. Observations: 143
Model: ARIMA(2, 1, 4) Log Likelihood -661.292
Method: css-mle S.D. of innovations 23.715
Date: Tue, 19 Nov 2019 AIC 1338.585
Time: 21:07:36 BIC 1362.288
Sample: 1 HQIC 1348.216
coef std err z P>|z| [0.025 0.975]
const 2.6854 0.134 19.979 0.000 2.422 2.949
ar.L1.D.y 0.9142 0.082 11.131 0.000 0.753 1.075
ar.L2.D.y -0.7709 0.080 -9.629 0.000 -0.928 -0.614
ma.L1.D.y -0.8356 0.101 -8.265 0.000 -1.034 -0.637
ma.L2.D.y 0.3271 0.122 2.691 0.008 0.089 0.565
ma.L3.D.y 0.3086 0.114 2.699 0.008 0.085 0.533
ma.L4.D.y -0.8000 0.079 -10.112 0.000 -0.955 -0.645
Roots
Real Imaginary Modulus Frequency
AR.1 0.5929 -0.9724j 1.1389 -0.1628
AR.2 0.5929 +0.9724j 1.1389 0.1628
MA.1 1.0001 -0.0000j 1.0001 -0.0000
MA.2 0.3178 -0.9481j 1.0000 -0.1985
MA.3 0.3178 +0.9481j 1.0000 0.1985
MA.4 -1.2500 -0.0000j 1.2500 -0.5000
Note:

ARIMA(p,d,q) is same as ARMA(p,q) after differncing data with order 'd'. We can see this later.

In [ ]:
 
In [420]:
#build model
arima_mode = ARIMA(train_data['Thousands of Passengers'], order=(2,1,4)).fit()

#forecast next 24
arima_prediction = arima_mode.predict(start=test_data.index[0], 
                                     end=test_data.index[-1],
                                     typ='levels',
                                     dynamic=False)

#create plots with test_data and prediction
test_data['Thousands of Passengers'].plot(legend=True, label='Test_Set', figsize=(13,7))
arima_prediction.plot(legend=True, label='ARIMA prediction')
plt.show()
In [ ]:
 

Evaluate

In [1299]:
# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], arima_prediction)
arma_r2_score
Out[1299]:
0.14187879048403407
In [1300]:
#mean_absolute
absmean = mean_absolute_error(test_data['Thousands of Passengers'], arima_prediction)
absmean
Out[1300]:
57.26592096364512
Observation:

We got the exact same result from the two models ARMA(2,4) with data after differencing '1' and ARIMA(2,1,4) with out differencing data(original).

ARIMA Forecast

In [422]:
# train with whole dataset
arima_model_f = ARIMA(df['Thousands of Passengers'], order=(2, 1, 4)).fit()


#forecast next 24 months
arima_prediction_f = arima_model_f.predict(start=len(df), 
                                      end=len(df)+23,
                                      typ='levels',
                                      dynamic=False)

#create plots
ax=  df['Thousands of Passengers'].plot(title= 'Next 24 months Forecast',legend= True, label='Original', figsize=(15,7))
arima_prediction_f.plot(legend= True, label='ARIMA_Forecast')
ax.legend(loc=2)
plt.show()
In [ ]:
 

Seasonal Autoregressive Integrated Moving Averages (SARIMA(p,d,q)(P,D,Q)m)

This model addresses the seasonality of series. It has additional paramerter (P,D,Q)m, where P,D,Q represent the seasonal regression, differencing and moving average coefficients respectively and 'm' represents the number of data points (rows) in each seasonal cycle.

In [423]:
#tune  parameter for SARIMAX
auto_arima(df['Thousands of Passengers'], seasonal=True).summary()
Out[423]:
Statespace Model Results
Dep. Variable: y No. Observations: 144
Model: SARIMAX(2, 1, 2) Log Likelihood -666.022
Date: Tue, 19 Nov 2019 AIC 1344.044
Time: 21:08:39 BIC 1361.821
Sample: 0 HQIC 1351.268
- 144
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.6619 0.193 3.425 0.001 0.283 1.041
ar.L1 1.6479 0.028 58.889 0.000 1.593 1.703
ar.L2 -0.9096 0.025 -36.755 0.000 -0.958 -0.861
ma.L1 -1.9079 0.355 -5.371 0.000 -2.604 -1.212
ma.L2 0.9977 0.371 2.689 0.007 0.270 1.725
sigma2 611.7553 246.970 2.477 0.013 127.704 1095.807
Ljung-Box (Q): 358.18 Jarque-Bera (JB): 2.78
Prob(Q): 0.00 Prob(JB): 0.25
Heteroskedasticity (H): 7.02 Skew: 0.33
Prob(H) (two-sided): 0.00 Kurtosis: 3.18


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [ ]:
 

Build SARIMAX Model

In [425]:
# use the sugessted parameters
sarima_mode = SARIMAX(df['Thousands of Passengers'], order=(2, 1, 2)).fit()

#forecast next 24
sarima_prediction = sarima_mode.predict(start=len(train_data), 
                                      end=len(train_data)+len(test_data)-1,
                                      typ='levels',
                                      dynamic=False)

#create plots
ax= df['Thousands of Passengers'].iloc[-80:].plot(legend= True, label='Original',figsize=(15,7))
test_data['Thousands of Passengers'].plot(legend= True, label='Test_data')
sarima_prediction.plot(legend= True, label='SARIMA_prediction')
ax.legend(loc=2);
In [ ]:
 

Model Evaluate

In [1307]:
#r2_score
sarima_r2_score= r2_score(test_data['Thousands of Passengers'], sarima_prediction)
sarima_r2_score
Out[1307]:
0.7650099434475135
In [1308]:
#mean_absolute
absmean = mean_absolute_error(test_data['Thousands of Passengers'], sarima_prediction)
absmean
Out[1308]:
31.545468954422674
In [1309]:
# test data std
test_data.std()
Out[1309]:
Thousands of Passengers    79.329152
dtype: float64

Note:

We can see that SARIMA model gives better result than ARIMA since SARIMA model address seasonlity, but lack in ARIMA model.

In [ ]:
 

Forcast with SARIMA model

In [426]:
# train with whole dataset
sarima_model = SARIMAX(df['Thousands of Passengers'], order=(2, 1, 2)).fit()


#forecast next 24 months
sarima_prediction = sarima_model.predict(start=len(df), 
                                      end=len(df)+24, 
                                      dynamic=False)

#create plots
ax=  df['Thousands of Passengers'].plot(title= 'Next 24 months Forecast',legend= True, label='Original',figsize=(15,7))
sarima_prediction.plot(legend= True, label='SARIMA_prediction')
ax.legend(loc=2)
plt.show()

Conclusion:

When we comapare these models(AR, MA, ARMA/ARIMA, SARIMA), the result get better from SARIMA model. I have also tried this dataset with Holt-Winter method which did very good result with additive trend(mean_absolute_error = 25.8, bit lesser than SARIMA result). You can find the Holt-Winter method result in the another jupyter notebook 'Time-Series-Analysis (Statsmodels).ipynb'

In [ ]:
 
In [ ]:
 
In [ ]:
 

Section2

Multi-variate Time series Analysis

VAR(p) and VARMA(p,q)

If we want to predict the multiple time series at the same time stamps, we can use VAR and VARMA model. Each series should be correlated with each others meaning they have each other influences in the prediction. In other words, the series(variables) are endogenous in nature.

In for this task, we will use airpollution dataset. This dataset consists of hourly update of 12 attributes between Jan 1st, 2010 to Dec 31st, 2014. The attributes consists of Datetime, pollution(pm2.5) dew point(Dewp), temperature(temp), Pressure (PRES), wind direction(cbwd), wind speed(lws), snow(ls) and rain(lr). Dataset is taken from here.

In [35]:
#load dataset
df_2 = pd.read_csv('Data/airpollution.csv')
df_2.head()
Out[35]:
No year month day hour pm2.5 DEWP TEMP PRES cbwd Iws Is Ir
0 1 2010 1 1 0 NaN -21 -11.0 1021.0 NW 1.79 0 0
1 2 2010 1 1 1 NaN -21 -12.0 1020.0 NW 4.92 0 0
2 3 2010 1 1 2 NaN -21 -11.0 1019.0 NW 6.71 0 0
3 4 2010 1 1 3 NaN -21 -14.0 1019.0 NW 9.84 0 0
4 5 2010 1 1 4 NaN -20 -12.0 1018.0 NW 12.97 0 0
In [37]:
df_2.dtypes
Out[37]:
No         int64
year       int64
month      int64
day        int64
hour       int64
pm2.5    float64
DEWP       int64
TEMP     float64
PRES     float64
cbwd      object
Iws      float64
Is         int64
Ir         int64
dtype: object
In [6]:
#perprocess data
def clean_data(data):
    """
    data: panda dataframe
    return clean dataframe
    """
    
    # make datatime as index 
    data.set_index( pd.to_datetime(data[['year','month', 'day', 'hour']]), inplace= True)
    data.index.name = 'Date_time' # index name
   
    # lets drop columns 'No', 'year', 'month', 'day' ,'hour'
    data.drop(columns=['No', 'year', 'month', 'day' ,'hour'], inplace=True)
    
    
    # rename the columns name
    data.rename(columns={'pm2.5':'pollution',
                         'DEWP':'dewp',
                         'TEMP':'temp',
                         'PRES': 'press',
                         'cbwd': 'wnd_dir',
                         'Iws' :'wnd_spd',
                         'Is': 'snow',
                         'Ir' : 'rain'
                        }, inplace=True)
    
    #fill NaN with 0 and drop first 24 rows(since  column: pm2.5 NaN values first 24)
    data.fillna(0, inplace=True)
    data = data[24:] 
    
    return data

    
In [7]:
# call clean_data() function
clean_data = clean_data(df_2)
clean_data.head()
Out[7]:
pollution dewp temp press wnd_dir wnd_spd snow rain
Date_time
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
In [8]:
# check null
clean_data.isnull().sum()
Out[8]:
pollution    0
dewp         0
temp         0
press        0
wnd_dir      0
wnd_spd      0
snow         0
rain         0
dtype: int64
In [9]:
#check data types
clean_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43800 entries, 2010-01-02 00:00:00 to 2014-12-31 23:00:00
Data columns (total 8 columns):
pollution    43800 non-null float64
dewp         43800 non-null int64
temp         43800 non-null float64
press        43800 non-null float64
wnd_dir      43800 non-null object
wnd_spd      43800 non-null float64
snow         43800 non-null int64
rain         43800 non-null int64
dtypes: float64(4), int64(3), object(1)
memory usage: 3.0+ MB

We can see that there is no frequency set in the data time index. Lets set frequency based on hour since we have hourly data.

In [10]:
#check data types
clean_data.index.freq = 'H'
clean_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43800 entries, 2010-01-02 00:00:00 to 2014-12-31 23:00:00
Freq: H
Data columns (total 8 columns):
pollution    43800 non-null float64
dewp         43800 non-null int64
temp         43800 non-null float64
press        43800 non-null float64
wnd_dir      43800 non-null object
wnd_spd      43800 non-null float64
snow         43800 non-null int64
rain         43800 non-null int64
dtypes: float64(4), int64(3), object(1)
memory usage: 3.0+ MB
In [ ]:
 
In [11]:
# Visualize all time series
fig, axes = plt.subplots(nrows= int(len(clean_data.columns)/2), ncols=2, dpi=400, figsize=(10,7))
for i, ax in enumerate(axes.flatten()): # loop all subplots
    data = clean_data[clean_data.columns[i]]
    ax.plot(data, linewidth=1)
    ax.autoscale(axis='x',tight=True) # tight fit to x-axis
    
    # Decorations
    ax.set_title(clean_data.columns[i])
    ax.tick_params(labelsize=6, colors='r')
    
plt.tight_layout(); # fit suplots nicely in the figure

From the above plots, we can see that pretty much every time series are stationary(not showing trend or seasonality), but not sure about this. So, we will do statistical test to check for stationary later in this work.

In [ ]:
 
In [646]:
#viusalize
series_tovisalize= ['pollution', 'temp']
for x in series_tovisalize:
    clean_data[x].plot( legend='True', figsize=(10,7))
In [ ]:
 

Stationary Test

For this, I have used adf test to check whether the series are stationary( has no unit root) or non-stationary(has unit root). We will take only two series to full fill our work.

In [1086]:
# stationary check for series 'pollution' and 'temp'
for x in clean_data[['pollution', 'temp']].columns:
    print(f'Stationary check for {x}:')
    adfuller_test(clean_data[x])
    print()
Stationary check for pollution:
ADF_test                  -21.004109
p-value                     0.000000
lags used                  55.000000
observations            43744.000000
critical value (1%)        -3.430499
critical value (5%)        -2.861606
critical value (10%)       -2.566805
dtype: float64
Strong evidence against  to reject  null hypothesis
It has no unit root and is stationary

Stationary check for temp:
ADF_test                   -3.938179
p-value                     0.001771
lags used                  49.000000
observations            43750.000000
critical value (1%)        -3.430499
critical value (5%)        -2.861606
critical value (10%)       -2.566805
dtype: float64
Strong evidence against  to reject  null hypothesis
It has no unit root and is stationary

Based on the result, both series are stationary.

In [ ]:
 

Causation Test using Granger’s Causality

Since VAR model is based on the assumption that each time series influences each other meaning the current value of series can be predicted using the past values itself and other series.

Thus, we will use Granger’s Causality Test to find the series those have influenced each other. Here, we are using this test on both way, for example, test of(a,b) and (b,a) to see whether they are influenced each other or not.

In [12]:
from statsmodels.tsa.stattools import grangercausalitytests

# tests causation to  pollution by temp
test =grangercausalitytests(clean_data[[ 'pollution', 'temp']], 3)
test
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=7.9588  , p=0.0048  , df_denom=43796, df_num=1
ssr based chi2 test:   chi2=7.9593  , p=0.0048  , df=1
likelihood ratio test: chi2=7.9586  , p=0.0048  , df=1
parameter F test:         F=7.9588  , p=0.0048  , df_denom=43796, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=96.3906 , p=0.0000  , df_denom=43793, df_num=2
ssr based chi2 test:   chi2=192.8033, p=0.0000  , df=2
likelihood ratio test: chi2=192.3801, p=0.0000  , df=2
parameter F test:         F=96.3906 , p=0.0000  , df_denom=43793, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=101.4655, p=0.0000  , df_denom=43790, df_num=3
ssr based chi2 test:   chi2=304.4450, p=0.0000  , df=3
likelihood ratio test: chi2=303.3918, p=0.0000  , df=3
parameter F test:         F=101.4655, p=0.0000  , df_denom=43790, df_num=3
Out[12]:
{1: ({'ssr_ftest': (7.958789095176419, 0.00478759439334035, 43796.0, 1),
   'ssr_chi2test': (7.959334267504612, 0.004784000724375123, 1),
   'lrtest': (7.958611153590027, 0.004785912422502543, 1),
   'params_ftest': (7.958789095128304, 0.004787594393448114, 43796.0, 1.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c26c9a828>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c26c9ab38>,
   array([[0., 1., 0.]])]),
 2: ({'ssr_ftest': (96.39062583219199, 1.698062964460352e-42, 43793.0, 2),
   'ssr_chi2test': (192.8032621742445, 1.3592632413122363e-42, 2),
   'lrtest': (192.38013367931126, 1.6795183671528724e-42, 2),
   'params_ftest': (96.39062583218988, 1.698062964460352e-42, 43793.0, 2.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c186b14a8>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c1859ea90>,
   array([[0., 0., 1., 0., 0.],
          [0., 0., 0., 1., 0.]])]),
 3: ({'ssr_ftest': (101.46546140882215, 1.8768270218688147e-65, 43790.0, 3),
   'ssr_chi2test': (304.4450431597751, 1.085707707400526e-65, 3),
   'lrtest': (303.3917805660749, 1.8351786278643514e-65, 3),
   'params_ftest': (101.46546140880908, 1.8768270219012471e-65, 43790.0, 3.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c19bbeb70>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c26c814a8>,
   array([[0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0.]])])}

Note: we can see that temp have some influence (casuation) in pollution series

In [1083]:
# tests causation to temp by pollution
test_1 =grangercausalitytests(clean_data[[ 'temp','pollution']], 3)
test_1
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=6.1561  , p=0.0131  , df_denom=43796, df_num=1
ssr based chi2 test:   chi2=6.1566  , p=0.0131  , df=1
likelihood ratio test: chi2=6.1561  , p=0.0131  , df=1
parameter F test:         F=6.1561  , p=0.0131  , df_denom=43796, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=12.9347 , p=0.0000  , df_denom=43793, df_num=2
ssr based chi2 test:   chi2=25.8724 , p=0.0000  , df=2
likelihood ratio test: chi2=25.8648 , p=0.0000  , df=2
parameter F test:         F=12.9347 , p=0.0000  , df_denom=43793, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=5.8557  , p=0.0005  , df_denom=43790, df_num=3
ssr based chi2 test:   chi2=17.5700 , p=0.0005  , df=3
likelihood ratio test: chi2=17.5665 , p=0.0005  , df=3
parameter F test:         F=5.8557  , p=0.0005  , df_denom=43790, df_num=3
Out[1083]:
{1: ({'ssr_ftest': (6.1561295560460545, 0.013099469644625956, 43796.0, 1),
   'ssr_chi2test': (6.156551247265987, 0.013092623763082732, 1),
   'lrtest': (6.156118593877181, 0.013095826770283996, 1),
   'params_ftest': (6.1561295560303035, 0.013099469644747352, 43796.0, 1.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe8c588>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe8c3c8>,
   array([[0., 1., 0.]])]),
 2: ({'ssr_ftest': (12.934726641969696, 2.422022223358371e-06, 43793.0, 2),
   'ssr_chi2test': (25.872406889913403, 2.4092298039642377e-06, 2),
   'lrtest': (25.864768208411988, 2.418449068080844e-06, 2),
   'params_ftest': (12.934726641949485, 2.4220222234016957e-06, 43793.0, 2.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe8cba8>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe8ce48>,
   array([[0., 0., 1., 0., 0.],
          [0., 0., 0., 1., 0.]])]),
 3: ({'ssr_ftest': (5.855744336565538, 0.0005409920405330191, 43790.0, 3),
   'ssr_chi2test': (17.570041199490355, 0.0005394212089508309, 3),
   'lrtest': (17.566517855302664, 0.0005403234578787887, 3),
   'params_ftest': (5.855744336555925, 0.0005409920405393607, 43790.0, 3.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe98978>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe983c8>,
   array([[0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0.]])])}

Note: From this result, we conclude that pollution has causation to temp. So, both series have influence to each other.

Split Train and Test Data

In [1089]:
#latest 72 hours for testdset and remaining trainset
train_endo = clean_data.iloc[:-72, [0,2]] # columns= pollution, temp
test_endo = clean_data.iloc[-72:, [0,2]]
In [ ]:
 

Built VAR Model

In [1091]:
#create VAR model
var_model =VAR(train_endo)
var_result = var_model.fit(maxlags=100, ic='aic')# var lag order selection
var_predica = var_result.forecast(y = train_endo.values[-var_result.k_ar:], steps=len(test_endo)) # requires lag_order previous values

#prediction in dataframe
var_pred_dfa = pd.DataFrame({ 'pollution_pd': var_predica[:, 0],'temp_pd': var_predica[:, 1],})

#set index
var_pred_dfa.set_index(pd.date_range('2014-12-29 00:00:00', periods= len(test_endo), freq='H'), inplace=True)
var_pred_dfa.head(3)
Out[1091]:
pollution_pd temp_pd
2014-12-29 00:00:00 336.200249 -2.314771
2014-12-29 01:00:00 328.570901 -2.434432
2014-12-29 02:00:00 314.027518 -2.823354
In [ ]:
 
In [1093]:
#visualize prediction and test series
var_pred_dfa['temp_pd'].plot( legend=True);
test_endo['temp'].plot(figsize=(10,7), legend=True);
In [1119]:
#visualize prediction and test series
var_pred_dfa['pollution_pd'].plot(label= 'pollution_pred', legend=True);
test_endo['pollution'].plot(figsize=(10,7), label = 'original_pollution',legend=True);
In [ ]:
 
Evaluate
In [1114]:
#abs mean error
for x in zip( test_endo.columns, var_pred_dfa.columns):
    print(f'{x[0]} mean_absolute_error:')
    print(np.round(mean_absolute_error(test_endo[x[0]], var_pred_dfa[x[1]]),3))
    print()
pollution mean_absolute_error:
100.164

temp mean_absolute_error:
2.358

In [ ]:
 

VARMAX model

It is the combinatin of VAR and MA. VAR is the linear combination of lagged values of order 'p' and MA uses the past forecast errors of order 'q' in a regression-like mode . As like VAR, this model is also used for multile variables forecast at same time. To find the optimal parameters (p and q), we will use auto_arima() function.

In [1128]:
# find parameter(p,q)
for x in clean_data.iloc[:,[0,2]].columns:
    print(f'Order for {x}')
    print(auto_arima(clean_data[x], maxiter=1000))
    print()
Order for pollution
ARIMA(callback=None, disp=0, maxiter=1000, method=None, order=(3, 0, 2),
      out_of_sample_size=0, scoring='mse', scoring_args=None,
      seasonal_order=(0, 0, 0, 1), solver='lbfgs', start_params=None,
      suppress_warnings=False, transparams=True, trend=None,
      with_intercept=True)

Order for temp
ARIMA(callback=None, disp=0, maxiter=1000, method=None, order=(3, 1, 3),
      out_of_sample_size=0, scoring='mse', scoring_args=None,
      seasonal_order=(0, 0, 0, 1), solver='lbfgs', start_params=None,
      suppress_warnings=False, transparams=True, trend=None,
      with_intercept=True)

Note: Above results show that 'temp' is non-stationary and 'pollution' is stationary. Lets feed these series to model with out differencing since two series are already showed stationary by adf test with 95% condifidence level.

Since the suggested orders are different for both series, so we have to try both and take the model as final model which give less AIC score.

In [ ]:
 
In [1129]:
# try build model with order=(3,2)
varma_model =VARMAX(train_endo, order=(3,2), trend='c')
varma_result = varma_model.fit(maxiter=1000, disp= False)
In [1131]:
varma_result.summary()
Out[1131]:
Statespace Model Results
Dep. Variable: ['pollution', 'temp'] No. Observations: 43728
Model: VARMA(3,2) Log Likelihood -280337.555
+ intercept AIC 560725.110
Date: Mon, 25 Nov 2019 BIC 560942.253
Time: 21:55:22 HQIC 560793.555
Sample: 01-02-2010
- 12-28-2014
Covariance Type: opg
Ljung-Box (Q): 287.49, 6924.85 Jarque-Bera (JB): 24630445.23, 24469.33
Prob(Q): 0.00, 0.00 Prob(JB): 0.00, 0.00
Heteroskedasticity (H): 0.67, 1.12 Skew: -0.71, -0.13
Prob(H) (two-sided): 0.00, 0.00 Kurtosis: 119.26, 6.65
Results for equation pollution
coef std err z P>|z| [0.025 0.975]
intercept 4.3518 0.423 10.293 0.000 3.523 5.180
L1.pollution 0.9010 0.035 25.777 0.000 0.833 0.970
L1.temp -1.1235 0.237 -4.742 0.000 -1.588 -0.659
L2.pollution 0.3664 0.056 6.511 0.000 0.256 0.477
L2.temp 0.0613 0.363 0.169 0.866 -0.651 0.773
L3.pollution -0.2943 0.041 -7.152 0.000 -0.375 -0.214
L3.temp 1.0436 0.163 6.409 0.000 0.724 1.363
L1.e(pollution) 0.0474 0.035 1.347 0.178 -0.022 0.116
L1.e(temp) 0.5382 0.227 2.370 0.018 0.093 0.983
L2.e(pollution) -0.2730 0.043 -6.389 0.000 -0.357 -0.189
L2.e(temp) 0.5793 0.122 4.758 0.000 0.341 0.818
Results for equation temp
coef std err z P>|z| [0.025 0.975]
intercept -0.1679 0.043 -3.866 0.000 -0.253 -0.083
L1.pollution -0.1428 0.008 -18.165 0.000 -0.158 -0.127
L1.temp 2.1278 0.037 56.787 0.000 2.054 2.201
L2.pollution 0.3137 0.010 31.345 0.000 0.294 0.333
L2.temp -1.3697 0.087 -15.788 0.000 -1.540 -1.200
L3.pollution -0.1697 0.005 -36.301 0.000 -0.179 -0.161
L3.temp 0.2409 0.051 4.741 0.000 0.141 0.341
L1.e(pollution) 0.1432 0.008 18.110 0.000 0.128 0.159
L1.e(temp) -0.9669 0.037 -25.946 0.000 -1.040 -0.894
L2.e(pollution) -0.1764 0.005 -36.073 0.000 -0.186 -0.167
L2.e(temp) 0.0298 0.044 0.673 0.501 -0.057 0.116
Error covariance matrix
coef std err z P>|z| [0.025 0.975]
sqrt.var.pollution 27.1564 0.018 1499.340 0.000 27.121 27.192
sqrt.cov.pollution.temp 0.0234 0.006 3.796 0.000 0.011 0.036
sqrt.var.temp 1.3757 0.003 396.399 0.000 1.369 1.383


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [1132]:
# try with order=(3,3)
varma_model_2 =VARMAX(train_endo, order=(3,3), trend='c')
varma_result_2 = varma_model_2.fit(maxiter=1000, disp= False)
In [1134]:
varma_result_2.summary()
Out[1134]:
Statespace Model Results
Dep. Variable: ['pollution', 'temp'] No. Observations: 43728
Model: VARMA(3,3) Log Likelihood -281653.426
+ intercept AIC 563364.852
Date: Mon, 25 Nov 2019 BIC 563616.739
Time: 22:18:04 HQIC 563444.249
Sample: 01-02-2010
- 12-28-2014
Covariance Type: opg
Ljung-Box (Q): 178.45, 9453.30 Jarque-Bera (JB): 26565601.67, 34401.62
Prob(Q): 0.00, 0.00 Prob(JB): 0.00, 0.00
Heteroskedasticity (H): 0.67, 1.14 Skew: -0.63, 0.11
Prob(H) (two-sided): 0.00, 0.00 Kurtosis: 123.74, 7.34
Results for equation pollution
coef std err z P>|z| [0.025 0.975]
intercept 4.5384 0.985 4.606 0.000 2.607 6.470
L1.pollution 1.0123 0.221 4.581 0.000 0.579 1.445
L1.temp -0.6990 2.002 -0.349 0.727 -4.623 3.225
L2.pollution -0.0306 0.322 -0.095 0.924 -0.662 0.601
L2.temp 0.0760 3.215 0.024 0.981 -6.224 6.376
L3.pollution -0.0271 0.132 -0.206 0.837 -0.285 0.231
L3.temp 0.6147 1.272 0.483 0.629 -1.878 3.107
L1.e(pollution) -0.0386 0.221 -0.175 0.861 -0.472 0.394
L1.e(temp) -0.0603 2.005 -0.030 0.976 -3.989 3.869
L2.e(pollution) 0.0061 0.134 0.045 0.964 -0.257 0.270
L2.e(temp) -0.4846 0.888 -0.546 0.585 -2.226 1.256
L3.e(pollution) 0.0233 0.006 3.712 0.000 0.011 0.036
L3.e(temp) -0.4643 0.225 -2.067 0.039 -0.905 -0.024
Results for equation temp
coef std err z P>|z| [0.025 0.975]
intercept 0.1052 0.075 1.409 0.159 -0.041 0.252
L1.pollution 0.0101 0.021 0.483 0.629 -0.031 0.051
L1.temp 1.9077 0.087 21.834 0.000 1.736 2.079
L2.pollution -0.0207 0.029 -0.718 0.472 -0.077 0.036
L2.temp -1.1536 0.133 -8.694 0.000 -1.414 -0.894
L3.pollution 0.0108 0.010 1.133 0.257 -0.008 0.029
L3.temp 0.2358 0.049 4.806 0.000 0.140 0.332
L1.e(pollution) -0.0100 0.021 -0.478 0.633 -0.051 0.031
L1.e(temp) -0.6823 0.088 -7.784 0.000 -0.854 -0.511
L2.e(pollution) 0.0110 0.010 1.153 0.249 -0.008 0.030
L2.e(temp) 0.2845 0.035 8.025 0.000 0.215 0.354
L3.e(pollution) -0.0013 0.001 -1.762 0.078 -0.003 0.000
L3.e(temp) 0.0651 0.018 3.547 0.000 0.029 0.101
Error covariance matrix
coef std err z P>|z| [0.025 0.975]
sqrt.var.pollution 27.1276 0.018 1471.307 0.000 27.091 27.164
sqrt.cov.pollution.temp -0.0283 0.006 -5.040 0.000 -0.039 -0.017
sqrt.var.temp 1.3488 0.003 499.903 0.000 1.344 1.354


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Note: Model with order(3,3) gave the less AIC score. So, we are using this for our final model.

In [ ]:
 

Forecast

In [ ]:
 
In [1163]:
#visualize pollution series
plt.figure(figsize=(9,6))
varma_forecast['pollution'].plot(label= 'forecast_pollu', legend=True);
test_endo['pollution'].plot( label= 'original_pollu',legend=True)
plt.show()
In [1165]:
#visualize temp series
plt.figure(figsize=(9,6))
varma_forecast['temp'].iloc[:72].plot(label= 'forecast_temp', legend=True)
test_endo['temp'].plot( label= 'original_temp',legend=True)
plt.show()
Evaluate
In [1166]:
#abs mean error
for x in zip( test_endo.columns, varma_forecast.columns):
    print(f'{x[0]} mean_absolute_error:')
    print(np.round(mean_absolute_error(test_endo[x[0]], varma_forecast[x[1]]),3))
    print()
pollution mean_absolute_error:
95.272

temp mean_absolute_error:
7.876

Conclusion:

In this work, I have learnt how to model univariate as well as multivariate time series with the statistical models. In addition, stationary check as well as Granger’s Causality test were perfomed to check the series whether they are sutiable to feed them direclty in the model or not.

In [ ]: