Time Series Analysis and Forecasting using Statsmodels¶

In this notebook, we explore the forecasting models from statsmodel, for example, AR, MA, ARMA/ARIMA, SARIMA, VAR and VARIMAX. The main motivation behind doing this task is to understand how to model the timeseries data with these algorithms. In this notebook, there is two sections:
1) univariate time series foreceasting
2) multi-variate forecasting(forecasting multiple series at same time).

To do univariate forecasting, I have used the dataset- 'airline passenger' to all univariate algoriths(AR, MA, ARMA/ARIMA, SARIMA). Similarly, VAR and VARIMAX are used for modeling multi-variate series. And, lastly figure out which algorithm is suitable for this data type. In addition, this notebook gives how to apply the stationarity check (series has unit root or not) with augmented Dickey-Fuller Test.

#import required libraries
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from pylab import rcParams #set figure size

from pmdarima import auto_arima # for finding arima orders

# for metric
from statsmodels.tools.eval_measures import rmse
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score

from statsmodels.tsa.stattools import adfuller # for stationarity test
from statsmodels.tsa.statespace.tools import diff 
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf, month_plot,quarter_plot #plots

#for creating models
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARMA, ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.api import VAR, VARMAX

#ignore warnings
import warnings
warnings.filterwarnings("ignore")

Section 1¶

Univariate Time Series Anlaysis

#load dataset
df= pd.read_csv('Data/airline_passengers.csv', index_col='Month', parse_dates=True)
df.index.freq = 'MS'# since data is monthly start frequency
df.head()

#lets visualize the data
ax = df['Thousands of Passengers'].plot(figsize=(12,5), legend= True, label='Original')
df['Thousands of Passengers'].rolling(window=12).mean().plot(legend= True, label='12-Month-Mean')
df['Thousands of Passengers'].rolling(window=12).std().plot(legend= True, label='12-Month-Std')

ax.autoscale(axis='x',tight=True) # fit to x-axis
ylabel='No. of passengers'
xlabel = 'Months'
ax.set(xlabel=xlabel, ylabel=ylabel) #name axis
plt.show()

From the plot, it clearly shows that this is non-stationary series since the mean and standard deviation keep on variation with resect to time period. Even, we can see that this series has clear upward trend. We will perfome statistical test to test stationary or not later in this work.

ETS Decompostion¶

Lets look the general behaviour in series. From ETS decomposition, we may get idea/clue whether time series has trend, seasonlaity or not.

# lets do seasonal decompostion
rcParams['figure.figsize']=15,7 # set figure size
sea_decom = seasonal_decompose(df['Thousands of Passengers'], model='additive')
sea_decom.plot();

Figures show clear upward trend and may be some seasonlity

Auto/Partial Correlation Plot¶

These plots are usually used for finding the 'p' and 'q' for ARIMA model, but in this notebook instead of using this classical method we will use auto_arima() function. To check whether or not there is correlation with the shiffted samples in the time series, these plots are also useful.

#create auto correaltion plot
plot_acf(df['Thousands of Passengers'], lags=40, title='ACF plot')
plt.show()

Figure shows there is correlation as well as seasonality in certian periods.

#create partial correaltion plot
plot_pacf(df['Thousands of Passengers'], lags=40, title='ACF plot');

Seasonality and quaterly plot¶

Lets figure out, which months and qauterly have low and high passenger.

#monthly plot
month_plot(df['Thousands of Passengers']);

#quaterly plot
quarter_plot(df['Thousands of Passengers'].resample(rule='Q').mean());

From both quaterly and montly plots, we can see there is some up and down in the airplane passenger. For the month of JUNE and JULY there is high peak as same as in 3rd quater.

Stationarized Time Series¶

We have to stationarize the time series data before feeding to alogrithms. From the above figures, it is clearly seen that there is seasonlity and trend in series. In other words, the series is non-statinary meaning that the mean, variance and autocorrelation change over time.

But, lets do statistic to show that the series is stationary or non-stationary(shows seasonality or trend). For this, the augmented Dickey-Fuller Test is used. It tests the hypothesis. Here, the null hyothesis is 'non-stationary'. We will use the significance level of 0.05 percentage (95% confidence level).

# stationary check
def adfuller_test(time_series):
    """
     time_series= single time series data
     returns an ADF report
     
    """
    
    # create adfuller  report
    result = adfuller(time_series, autolag='AIC') 
    
    
    #assigin the each value with name and store as pands series
    labels = ['ADF_test','p-value','lags used','observations']
    out = pd.Series(result[0:4],index=labels) #create series

    for key,val in result[4].items(): # add items(critical values) 
        out[f'critical value ({key})']=val
        
   
    #print out
    print(out)
    
   #check hypothesis 
    if np.round(result[1], 2) <= 0.05:
        print("Strong evidence against  to reject  null hypothesis")
        print("It has no unit root and is stationary")
    else:
        print("Weak evidence against  to reject null hypothesis")
        print("It has unit root and  is non-stationary")

#check stationary 
adfuller_test(df['Thousands of Passengers'])

ADF_test                  0.815369
p-value                   0.991880
lags used                13.000000
observations            130.000000
critical value (1%)      -3.481682
critical value (5%)      -2.884042
critical value (10%)     -2.578770
dtype: float64
Weak evidence against  to reject null hypothesis
It has unit root and  is non-stationary

Result tells that this time series data is non-stationary. So, we have to make it stationary. I have used differencing technique to make it stationary.

Models¶

Lets build the models from different alogrithms, like AR,MA, ARMA/ARIMA, SARIMA.

Note: we know clearly that this dataset is non-stationary(shows trend or seasonality), which is not suitable to model with AR, MA, ARMA since they are not able to learn trend and seasonl in dataset. But also lets try how it do in prediction.

'''
Note: this is just to understand for myself at what differencing the series become stationary
'''

# return stationary data with 95% confidence level(0.05 significance level)
def return_stationary_series(series, k=2):
    '''
    series = time series data
    k = differecing time 
    return  stationary time seres data
    '''
    for d in range(1, k):
        df_stat = diff(df['Thousands of Passengers'], k_diff = d) # differencing data

        # check p-value
        if np.round(adfuller(df_stat)[1]) <= 0.05:  # round with 2 decimal values
            print(f'Now the time-series is stationary at {d} difference')
            return df_stat
            break

#return stationary series and plot
stationary_series =return_stationary_series(df['Thousands of Passengers'], 10)
stationary_series.plot(figsize=(17,7));

Now the time-series is stationary at 1 difference

Now, the series looks as stationary. We can test with calling the function adfuller_test().

# testing for stationary
adfuller_test(stationary_series)

ADF_test                 -2.829267
p-value                   0.054213
lags used                12.000000
observations            130.000000
critical value (1%)      -3.481682
critical value (5%)      -2.884042
critical value (10%)     -2.578770
dtype: float64
Strong evidence against  to reject  null hypothesis
It has no unit root and is stationary

Train_Test split¶

#convert to dataframe
stat_data = stationary_series.to_frame() 

#split test(36 months) and train(remaining months)
train_data_stationary = stat_data.iloc[:-36] 
test_data_stationary = stat_data.iloc[-36:] 

#original data(non-stationary)
train_data= df.iloc[:-36]
test_data= df.iloc[-36:]

Autoregression(AR) Model¶

In this autoregression model, the linear combination of past values of the series are happened. It is calculated against a set of lagged values of order 𝑝 . We are using auto_arima() function to find the best 'p' for AR in this dataset.

#suggest  model
auto_arima(stat_data, seasonal=False).summary() # use stationary data

Build AR model¶

# AR model
model_ar = AR(train_data_stationary['Thousands of Passengers']) # use stationary data
#best_lag = model_ar.select_order(maxlag= 20,trend='nc', ic='aic') # select best maxlag
#print(f'best lag: {best_lag}')

# fit model with suggested p
model_ar_fit = model_ar.fit(maxlag= 2) 

#predict test data
ar_prediction = model_ar_fit.predict(start= test_data_stationary.index[0], 
                                     end= test_data_stationary.index[-1], 
                                     dynamic=False)
ar_prediction.head()

1958-01-01    23.500473
1958-02-01     2.308398
1958-03-01    -3.271831
1958-04-01     0.140311
1958-05-01     2.762978
Freq: MS, dtype: float64

Invert Transformation¶

Since our prediction values are already scaled with 1 differencing, we have to change it back to orginal scale before ploting and forecasting. You can obtained this result either by 1) summing columns (original value + stationary values/predictived values) or 2) just recent values of train data + prediction values using cumsum() function

#create dataframe
test_data_stationary['Passengers(original)'] = test_data['Thousands of Passengers'] # original scale 

# store predicted value in dataframe
test_data_stationary['Prediction_1d_AR'] = ar_prediction 

# add most recent  value from training dataset and do cumsum in stationary values
test_data_stationary['Inverted_1d_AR'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1d_AR'].cumsum()

#test_data_stationary['Inverted_Values'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1:-1].values + test_data_stationary['Prediction_1d_scale'].values
test_data_stationary.head()

# create plot with test and predicted
test_data.plot(legend= True, label= 'test data')
test_data_stationary['Inverted_1d_AR'].plot(legend= True, label='AR prediction', figsize=(13,5))
plt.show()

Pretty bad, right? We already know that this data has some seasonality. So AR model cann't address that problem. From the above figure also we can see that this model is doing very poor in capturing the peak and down parts.

Evaluate the Model¶

# r2_score
r2_result = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_AR'])
r2_result

-0.026951254751961695

# mean  absolute error
mean_absolute_error(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_AR'])

56.63657955498565

Note:¶

As we look at r2_score values, it is in negative which means poor prediciton. So, we can say that AR is too simple model for this dataset and it doesn't address the trend and seasonality in the series.

Forecasting¶

# train with whole dataset(stationary)
ar_final_model = AR(stat_data['Thousands of Passengers']).fit(maxlag= 2, ic='aic')


#forecast next 24
ar_forecast = ar_final_model.predict(start= stat_data.index[-1]+1, 
                                      end=stat_data.index[-1]+24, 
                                      dynamic=False)

#invert to original scale
original_scale = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + ar_forecast.cumsum()



#create plots
ax=  df['Thousands of Passengers'].plot(legend= True, label='Original data', title='Next 24 month Forecast', figsize=(15,7))
original_scale.plot(legend= True, label='AR_Forecast')

ax.legend(loc=2);

Very bad forecasting, right?

Moving Average(MA) model¶

MA model is a linear regression of the current value of the series against current and previous (observed) white noise error.

#fit model 
ma_model = ARMA(train_data_stationary['Thousands of Passengers'], order=(0,4)).fit()

#predict test data
ma_prediction = ma_model.predict(start= test_data_stationary.index[0], 
                                     end= test_data_stationary.index[-1],
                                     dynamic=False)

# put into dataframe
test_data_stationary['Prediction_1D_MA'] = ma_prediction

#invert data to original scale ( since predicted result is 1 differencing)
test_data_stationary['Inverted_1d_MA'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1D_MA'].cumsum()
test_data_stationary.head()

#create plots between test and prediction
test_data.plot(legend=True, label = 'Test Data', figsize=(15,7)) # testdata plot
test_data_stationary['Inverted_1d_MA'].plot(legend=True, label = 'MA Predicton')
plt.show()

# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_MA'])
arma_r2_score

0.18479185296233125

Autoregression Moving Averages(ARMA) model¶

It is the combinatin of AR and MA. AR is the linear combination of lagged values of order 'p' and MA uses the past forecast errors of order 'q' in a regression-like mode .

We can directly fit original series with ARIMA model or can be obtained the same result by making data stationary at first and pass it with the model ARMA. We have already changed into stationary data. So. lets use that data with ARMA.

#tune parameter  for ARMA
auto_arima(train_data_stationary, seasonal=False).summary()

It suggests ARMA(2, 4) model

#fit model 
arma_model = ARMA(train_data_stationary['Thousands of Passengers'], order=(2,4)).fit()

#predict test data
arma_prediction = arma_model.predict(start= test_data_stationary.index[0], 
                                     end= test_data_stationary.index[-1],
                                     dynamic=False)

# put into dataframe
test_data_stationary['Prediction_1D_ARMA'] = arma_prediction


#invert data to original scale ( since predicted result is 1 differencing)
test_data_stationary['Inverted_1d_ARMA'] = df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + test_data_stationary['Prediction_1D_ARMA'].cumsum()
test_data_stationary.head()

#create plots between test and prediction
test_data.plot(legend=True, label = 'Test Data',figsize=(15,7)) # testdata plot

test_data_stationary['Inverted_1d_ARMA'].plot(legend=True, label = 'ARMA Predicton')
plt.show()

Looks this model also doesn't learn much for this dataset. It predict in linear behaviour.

test_data_stationary['Inverted_1d_ARMA'].mean()

416.30909725267526

test_data.mean()

Thousands of Passengers    428.5
dtype: float64

Evaluate model¶

# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_ARMA'])
arma_r2_score

0.1418787904840343

# absolute mean error
mean_absolute_error(test_data['Thousands of Passengers'], test_data_stationary['Inverted_1d_ARMA'])

57.26592096364511

It has r2_score = 0.14. Still, model doesn't predict well. And, it is true also since we have dataset having seasonality which cann't handle by ARMA.

Future Forcast¶

# train with whole dataset(stationary)
arma_model = ARMA(stat_data['Thousands of Passengers'], order=(2,4)).fit()


#forecast next 24
arma_forecast = arma_model.predict(start=stat_data.index[-1], 
                                      end=stat_data.index[-1]+24, 
                                      dynamic=False)

#orignal data
original_arma_forecast= df['Thousands of Passengers'].iloc[-len(test_data_stationary)-1] + arma_forecast.cumsum()

#create plots
ax=  df['Thousands of Passengers'].plot(legend= True, label='Original_Data', title='Next 24 months Forecast', figsize=(15,7))
original_arma_forecast.plot(legend= True, label='ARMA_Forecast')
ax.legend(loc=2);

Autoregression Integration Moving Average(ARIMA)¶

This model combines AR, I and MA. Here, AR = Autoregression, I = Integration and MA = Moving Average. This model directly converts the non-stationary series to stationary. This is equavalent to ARMA(p,q) after differencing the original series with d in ARIMA(p,d,q). This model address trend but not seasonality.

#tune  parameter for ARIMA
auto_arima(df['Thousands of Passengers'], seasonal=False).summary()

Note:¶

ARIMA(p,d,q) is same as ARMA(p,q) after differncing data with order 'd'. We can see this later.

#build model
arima_mode = ARIMA(train_data['Thousands of Passengers'], order=(2,1,4)).fit()

#forecast next 24
arima_prediction = arima_mode.predict(start=test_data.index[0], 
                                     end=test_data.index[-1],
                                     typ='levels',
                                     dynamic=False)

#create plots with test_data and prediction
test_data['Thousands of Passengers'].plot(legend=True, label='Test_Set', figsize=(13,7))
arima_prediction.plot(legend=True, label='ARIMA prediction')
plt.show()

Evaluate¶

# r2_score
arma_r2_score = r2_score(test_data['Thousands of Passengers'], arima_prediction)
arma_r2_score

0.14187879048403407

#mean_absolute
absmean = mean_absolute_error(test_data['Thousands of Passengers'], arima_prediction)
absmean

57.26592096364512

Observation:¶

We got the exact same result from the two models ARMA(2,4) with data after differencing '1' and ARIMA(2,1,4) with out differencing data(original).

ARIMA Forecast¶

# train with whole dataset
arima_model_f = ARIMA(df['Thousands of Passengers'], order=(2, 1, 4)).fit()


#forecast next 24 months
arima_prediction_f = arima_model_f.predict(start=len(df), 
                                      end=len(df)+23,
                                      typ='levels',
                                      dynamic=False)

#create plots
ax=  df['Thousands of Passengers'].plot(title= 'Next 24 months Forecast',legend= True, label='Original', figsize=(15,7))
arima_prediction_f.plot(legend= True, label='ARIMA_Forecast')
ax.legend(loc=2)
plt.show()

Seasonal Autoregressive Integrated Moving Averages (SARIMA(p,d,q)(P,D,Q)m)¶

This model addresses the seasonality of series. It has additional paramerter (P,D,Q)m, where P,D,Q represent the seasonal regression, differencing and moving average coefficients respectively and 'm' represents the number of data points (rows) in each seasonal cycle.

#tune  parameter for SARIMAX
auto_arima(df['Thousands of Passengers'], seasonal=True).summary()

Build SARIMAX Model¶

# use the sugessted parameters
sarima_mode = SARIMAX(df['Thousands of Passengers'], order=(2, 1, 2)).fit()

#forecast next 24
sarima_prediction = sarima_mode.predict(start=len(train_data), 
                                      end=len(train_data)+len(test_data)-1,
                                      typ='levels',
                                      dynamic=False)

#create plots
ax= df['Thousands of Passengers'].iloc[-80:].plot(legend= True, label='Original',figsize=(15,7))
test_data['Thousands of Passengers'].plot(legend= True, label='Test_data')
sarima_prediction.plot(legend= True, label='SARIMA_prediction')
ax.legend(loc=2);

Model Evaluate¶

#r2_score
sarima_r2_score= r2_score(test_data['Thousands of Passengers'], sarima_prediction)
sarima_r2_score

0.7650099434475135

#mean_absolute
absmean = mean_absolute_error(test_data['Thousands of Passengers'], sarima_prediction)
absmean

31.545468954422674

# test data std
test_data.std()

Thousands of Passengers    79.329152
dtype: float64

Note:¶

We can see that SARIMA model gives better result than ARIMA since SARIMA model address seasonlity, but lack in ARIMA model.

Forcast with SARIMA model¶

# train with whole dataset
sarima_model = SARIMAX(df['Thousands of Passengers'], order=(2, 1, 2)).fit()


#forecast next 24 months
sarima_prediction = sarima_model.predict(start=len(df), 
                                      end=len(df)+24, 
                                      dynamic=False)

#create plots
ax=  df['Thousands of Passengers'].plot(title= 'Next 24 months Forecast',legend= True, label='Original',figsize=(15,7))
sarima_prediction.plot(legend= True, label='SARIMA_prediction')
ax.legend(loc=2)
plt.show()

Conclusion:¶

When we comapare these models(AR, MA, ARMA/ARIMA, SARIMA), the result get better from SARIMA model. I have also tried this dataset with Holt-Winter method which did very good result with additive trend(mean_absolute_error = 25.8, bit lesser than SARIMA result). You can find the Holt-Winter method result in the another jupyter notebook 'Time-Series-Analysis (Statsmodels).ipynb'

Section2¶

Multi-variate Time series Analysis

VAR(p) and VARMA(p,q)¶

If we want to predict the multiple time series at the same time stamps, we can use VAR and VARMA model. Each series should be correlated with each others meaning they have each other influences in the prediction. In other words, the series(variables) are endogenous in nature.

In for this task, we will use airpollution dataset. This dataset consists of hourly update of 12 attributes between Jan 1st, 2010 to Dec 31st, 2014. The attributes consists of Datetime, pollution(pm2.5) dew point(Dewp), temperature(temp), Pressure (PRES), wind direction(cbwd), wind speed(lws), snow(ls) and rain(lr). Dataset is taken from here.

#load dataset
df_2 = pd.read_csv('Data/airpollution.csv')
df_2.head()

df_2.dtypes

No         int64
year       int64
month      int64
day        int64
hour       int64
pm2.5    float64
DEWP       int64
TEMP     float64
PRES     float64
cbwd      object
Iws      float64
Is         int64
Ir         int64
dtype: object

#perprocess data
def clean_data(data):
    """
    data: panda dataframe
    return clean dataframe
    """
    
    # make datatime as index 
    data.set_index( pd.to_datetime(data[['year','month', 'day', 'hour']]), inplace= True)
    data.index.name = 'Date_time' # index name
   
    # lets drop columns 'No', 'year', 'month', 'day' ,'hour'
    data.drop(columns=['No', 'year', 'month', 'day' ,'hour'], inplace=True)
    
    
    # rename the columns name
    data.rename(columns={'pm2.5':'pollution',
                         'DEWP':'dewp',
                         'TEMP':'temp',
                         'PRES': 'press',
                         'cbwd': 'wnd_dir',
                         'Iws' :'wnd_spd',
                         'Is': 'snow',
                         'Ir' : 'rain'
                        }, inplace=True)
    
    #fill NaN with 0 and drop first 24 rows(since  column: pm2.5 NaN values first 24)
    data.fillna(0, inplace=True)
    data = data[24:] 
    
    return data

# call clean_data() function
clean_data = clean_data(df_2)
clean_data.head()

# check null
clean_data.isnull().sum()

pollution    0
dewp         0
temp         0
press        0
wnd_dir      0
wnd_spd      0
snow         0
rain         0
dtype: int64

#check data types
clean_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43800 entries, 2010-01-02 00:00:00 to 2014-12-31 23:00:00
Data columns (total 8 columns):
pollution    43800 non-null float64
dewp         43800 non-null int64
temp         43800 non-null float64
press        43800 non-null float64
wnd_dir      43800 non-null object
wnd_spd      43800 non-null float64
snow         43800 non-null int64
rain         43800 non-null int64
dtypes: float64(4), int64(3), object(1)
memory usage: 3.0+ MB

We can see that there is no frequency set in the data time index. Lets set frequency based on hour since we have hourly data.

#check data types
clean_data.index.freq = 'H'
clean_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43800 entries, 2010-01-02 00:00:00 to 2014-12-31 23:00:00
Freq: H
Data columns (total 8 columns):
pollution    43800 non-null float64
dewp         43800 non-null int64
temp         43800 non-null float64
press        43800 non-null float64
wnd_dir      43800 non-null object
wnd_spd      43800 non-null float64
snow         43800 non-null int64
rain         43800 non-null int64
dtypes: float64(4), int64(3), object(1)
memory usage: 3.0+ MB

# Visualize all time series
fig, axes = plt.subplots(nrows= int(len(clean_data.columns)/2), ncols=2, dpi=400, figsize=(10,7))
for i, ax in enumerate(axes.flatten()): # loop all subplots
    data = clean_data[clean_data.columns[i]]
    ax.plot(data, linewidth=1)
    ax.autoscale(axis='x',tight=True) # tight fit to x-axis
    
    # Decorations
    ax.set_title(clean_data.columns[i])
    ax.tick_params(labelsize=6, colors='r')
    
plt.tight_layout(); # fit suplots nicely in the figure

From the above plots, we can see that pretty much every time series are stationary(not showing trend or seasonality), but not sure about this. So, we will do statistical test to check for stationary later in this work.

#viusalize
series_tovisalize= ['pollution', 'temp']
for x in series_tovisalize:
    clean_data[x].plot( legend='True', figsize=(10,7))

Stationary Test¶

For this, I have used adf test to check whether the series are stationary( has no unit root) or non-stationary(has unit root). We will take only two series to full fill our work.

# stationary check for series 'pollution' and 'temp'
for x in clean_data[['pollution', 'temp']].columns:
    print(f'Stationary check for {x}:')
    adfuller_test(clean_data[x])
    print()

Stationary check for pollution:
ADF_test                  -21.004109
p-value                     0.000000
lags used                  55.000000
observations            43744.000000
critical value (1%)        -3.430499
critical value (5%)        -2.861606
critical value (10%)       -2.566805
dtype: float64
Strong evidence against  to reject  null hypothesis
It has no unit root and is stationary

Stationary check for temp:
ADF_test                   -3.938179
p-value                     0.001771
lags used                  49.000000
observations            43750.000000
critical value (1%)        -3.430499
critical value (5%)        -2.861606
critical value (10%)       -2.566805
dtype: float64
Strong evidence against  to reject  null hypothesis
It has no unit root and is stationary

Based on the result, both series are stationary.

Causation Test using Granger’s Causality¶

Since VAR model is based on the assumption that each time series influences each other meaning the current value of series can be predicted using the past values itself and other series.

Thus, we will use Granger’s Causality Test to find the series those have influenced each other. Here, we are using this test on both way, for example, test of(a,b) and (b,a) to see whether they are influenced each other or not.

from statsmodels.tsa.stattools import grangercausalitytests

# tests causation to  pollution by temp
test =grangercausalitytests(clean_data[[ 'pollution', 'temp']], 3)
test

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=7.9588  , p=0.0048  , df_denom=43796, df_num=1
ssr based chi2 test:   chi2=7.9593  , p=0.0048  , df=1
likelihood ratio test: chi2=7.9586  , p=0.0048  , df=1
parameter F test:         F=7.9588  , p=0.0048  , df_denom=43796, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=96.3906 , p=0.0000  , df_denom=43793, df_num=2
ssr based chi2 test:   chi2=192.8033, p=0.0000  , df=2
likelihood ratio test: chi2=192.3801, p=0.0000  , df=2
parameter F test:         F=96.3906 , p=0.0000  , df_denom=43793, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=101.4655, p=0.0000  , df_denom=43790, df_num=3
ssr based chi2 test:   chi2=304.4450, p=0.0000  , df=3
likelihood ratio test: chi2=303.3918, p=0.0000  , df=3
parameter F test:         F=101.4655, p=0.0000  , df_denom=43790, df_num=3

{1: ({'ssr_ftest': (7.958789095176419, 0.00478759439334035, 43796.0, 1),
   'ssr_chi2test': (7.959334267504612, 0.004784000724375123, 1),
   'lrtest': (7.958611153590027, 0.004785912422502543, 1),
   'params_ftest': (7.958789095128304, 0.004787594393448114, 43796.0, 1.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c26c9a828>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c26c9ab38>,
   array([[0., 1., 0.]])]),
 2: ({'ssr_ftest': (96.39062583219199, 1.698062964460352e-42, 43793.0, 2),
   'ssr_chi2test': (192.8032621742445, 1.3592632413122363e-42, 2),
   'lrtest': (192.38013367931126, 1.6795183671528724e-42, 2),
   'params_ftest': (96.39062583218988, 1.698062964460352e-42, 43793.0, 2.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c186b14a8>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c1859ea90>,
   array([[0., 0., 1., 0., 0.],
          [0., 0., 0., 1., 0.]])]),
 3: ({'ssr_ftest': (101.46546140882215, 1.8768270218688147e-65, 43790.0, 3),
   'ssr_chi2test': (304.4450431597751, 1.085707707400526e-65, 3),
   'lrtest': (303.3917805660749, 1.8351786278643514e-65, 3),
   'params_ftest': (101.46546140880908, 1.8768270219012471e-65, 43790.0, 3.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c19bbeb70>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c26c814a8>,
   array([[0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0.]])])}

Note: we can see that temp have some influence (casuation) in pollution series

# tests causation to temp by pollution
test_1 =grangercausalitytests(clean_data[[ 'temp','pollution']], 3)
test_1

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=6.1561  , p=0.0131  , df_denom=43796, df_num=1
ssr based chi2 test:   chi2=6.1566  , p=0.0131  , df=1
likelihood ratio test: chi2=6.1561  , p=0.0131  , df=1
parameter F test:         F=6.1561  , p=0.0131  , df_denom=43796, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=12.9347 , p=0.0000  , df_denom=43793, df_num=2
ssr based chi2 test:   chi2=25.8724 , p=0.0000  , df=2
likelihood ratio test: chi2=25.8648 , p=0.0000  , df=2
parameter F test:         F=12.9347 , p=0.0000  , df_denom=43793, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=5.8557  , p=0.0005  , df_denom=43790, df_num=3
ssr based chi2 test:   chi2=17.5700 , p=0.0005  , df=3
likelihood ratio test: chi2=17.5665 , p=0.0005  , df=3
parameter F test:         F=5.8557  , p=0.0005  , df_denom=43790, df_num=3

{1: ({'ssr_ftest': (6.1561295560460545, 0.013099469644625956, 43796.0, 1),
   'ssr_chi2test': (6.156551247265987, 0.013092623763082732, 1),
   'lrtest': (6.156118593877181, 0.013095826770283996, 1),
   'params_ftest': (6.1561295560303035, 0.013099469644747352, 43796.0, 1.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe8c588>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe8c3c8>,
   array([[0., 1., 0.]])]),
 2: ({'ssr_ftest': (12.934726641969696, 2.422022223358371e-06, 43793.0, 2),
   'ssr_chi2test': (25.872406889913403, 2.4092298039642377e-06, 2),
   'lrtest': (25.864768208411988, 2.418449068080844e-06, 2),
   'params_ftest': (12.934726641949485, 2.4220222234016957e-06, 43793.0, 2.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe8cba8>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe8ce48>,
   array([[0., 0., 1., 0., 0.],
          [0., 0., 0., 1., 0.]])]),
 3: ({'ssr_ftest': (5.855744336565538, 0.0005409920405330191, 43790.0, 3),
   'ssr_chi2test': (17.570041199490355, 0.0005394212089508309, 3),
   'lrtest': (17.566517855302664, 0.0005403234578787887, 3),
   'params_ftest': (5.855744336555925, 0.0005409920405393607, 43790.0, 3.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe98978>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1ddfe983c8>,
   array([[0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0.]])])}

Note: From this result, we conclude that pollution has causation to temp. So, both series have influence to each other.

Split Train and Test Data¶

#latest 72 hours for testdset and remaining trainset
train_endo = clean_data.iloc[:-72, [0,2]] # columns= pollution, temp
test_endo = clean_data.iloc[-72:, [0,2]]

Built VAR Model¶

#create VAR model
var_model =VAR(train_endo)
var_result = var_model.fit(maxlags=100, ic='aic')# var lag order selection
var_predica = var_result.forecast(y = train_endo.values[-var_result.k_ar:], steps=len(test_endo)) # requires lag_order previous values

#prediction in dataframe
var_pred_dfa = pd.DataFrame({ 'pollution_pd': var_predica[:, 0],'temp_pd': var_predica[:, 1],})

#set index
var_pred_dfa.set_index(pd.date_range('2014-12-29 00:00:00', periods= len(test_endo), freq='H'), inplace=True)
var_pred_dfa.head(3)

#visualize prediction and test series
var_pred_dfa['temp_pd'].plot( legend=True);
test_endo['temp'].plot(figsize=(10,7), legend=True);

#visualize prediction and test series
var_pred_dfa['pollution_pd'].plot(label= 'pollution_pred', legend=True);
test_endo['pollution'].plot(figsize=(10,7), label = 'original_pollution',legend=True);

Evaluate¶

#abs mean error
for x in zip( test_endo.columns, var_pred_dfa.columns):
    print(f'{x[0]} mean_absolute_error:')
    print(np.round(mean_absolute_error(test_endo[x[0]], var_pred_dfa[x[1]]),3))
    print()

pollution mean_absolute_error:
100.164

temp mean_absolute_error:
2.358

VARMAX model¶

It is the combinatin of VAR and MA. VAR is the linear combination of lagged values of order 'p' and MA uses the past forecast errors of order 'q' in a regression-like mode . As like VAR, this model is also used for multile variables forecast at same time. To find the optimal parameters (p and q), we will use auto_arima() function.

# find parameter(p,q)
for x in clean_data.iloc[:,[0,2]].columns:
    print(f'Order for {x}')
    print(auto_arima(clean_data[x], maxiter=1000))
    print()

Order for pollution
ARIMA(callback=None, disp=0, maxiter=1000, method=None, order=(3, 0, 2),
      out_of_sample_size=0, scoring='mse', scoring_args=None,
      seasonal_order=(0, 0, 0, 1), solver='lbfgs', start_params=None,
      suppress_warnings=False, transparams=True, trend=None,
      with_intercept=True)

Order for temp
ARIMA(callback=None, disp=0, maxiter=1000, method=None, order=(3, 1, 3),
      out_of_sample_size=0, scoring='mse', scoring_args=None,
      seasonal_order=(0, 0, 0, 1), solver='lbfgs', start_params=None,
      suppress_warnings=False, transparams=True, trend=None,
      with_intercept=True)

Note: Above results show that 'temp' is non-stationary and 'pollution' is stationary. Lets feed these series to model with out differencing since two series are already showed stationary by adf test with 95% condifidence level.

Since the suggested orders are different for both series, so we have to try both and take the model as final model which give less AIC score.

# try build model with order=(3,2)
varma_model =VARMAX(train_endo, order=(3,2), trend='c')
varma_result = varma_model.fit(maxiter=1000, disp= False)

varma_result.summary()

# try with order=(3,3)
varma_model_2 =VARMAX(train_endo, order=(3,3), trend='c')
varma_result_2 = varma_model_2.fit(maxiter=1000, disp= False)

varma_result_2.summary()

Note: Model with order(3,3) gave the less AIC score. So, we are using this for our final model.

Forecast¶

#visualize pollution series
plt.figure(figsize=(9,6))
varma_forecast['pollution'].plot(label= 'forecast_pollu', legend=True);
test_endo['pollution'].plot( label= 'original_pollu',legend=True)
plt.show()

#visualize temp series
plt.figure(figsize=(9,6))
varma_forecast['temp'].iloc[:72].plot(label= 'forecast_temp', legend=True)
test_endo['temp'].plot( label= 'original_temp',legend=True)
plt.show()

Evaluate¶

#abs mean error
for x in zip( test_endo.columns, varma_forecast.columns):
    print(f'{x[0]} mean_absolute_error:')
    print(np.round(mean_absolute_error(test_endo[x[0]], varma_forecast[x[1]]),3))
    print()

pollution mean_absolute_error:
95.272

temp mean_absolute_error:
7.876

Conclusion:¶

In this work, I have learnt how to model univariate as well as multivariate time series with the statistical models. In addition, stationary check as well as Granger’s Causality test were perfomed to check the series whether they are sutiable to feed them direclty in the model or not.

	Thousands of Passengers
Month
1949-01-01	112
1949-02-01	118
1949-03-01	132
1949-04-01	129
1949-05-01	121

Dep. Variable:	y	No. Observations:	143
Model:	ARMA(2, 4)	Log Likelihood	-661.292
Method:	css-mle	S.D. of innovations	23.715
Date:	Tue, 19 Nov 2019	AIC	1338.585
Time:	21:04:03	BIC	1362.288
Sample:	0	HQIC	1348.216

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	2.6854	0.134	19.979	0.000	2.422	2.949
ar.L1.y	0.9142	0.082	11.131	0.000	0.753	1.075
ar.L2.y	-0.7709	0.080	-9.629	0.000	-0.928	-0.614
ma.L1.y	-0.8356	0.101	-8.265	0.000	-1.034	-0.637
ma.L2.y	0.3271	0.122	2.691	0.008	0.089	0.565
ma.L3.y	0.3086	0.114	2.699	0.008	0.085	0.533
ma.L4.y	-0.8000	0.079	-10.112	0.000	-0.955	-0.645

	Real	Imaginary	Modulus	Frequency
AR.1	0.5929	-0.9724j	1.1389	-0.1628
AR.2	0.5929	+0.9724j	1.1389	0.1628
MA.1	1.0001	-0.0000j	1.0001	-0.0000
MA.2	0.3178	-0.9481j	1.0000	-0.1985
MA.3	0.3178	+0.9481j	1.0000	0.1985
MA.4	-1.2500	-0.0000j	1.2500	-0.5000

	Thousands of Passengers	Passengers(original)	Prediction_1d_AR	Inverted_1d_AR	Prediction_1D_MA	Inverted_1d_MA	Prediction_1D_ARMA	Inverted_1d_ARMA
Month
1958-01-01	4.0	340	23.500473	359.500473	34.548432	370.548432	49.208305	385.208305
1958-02-01	-22.0	318	2.308398	361.808871	4.991931	375.540362	28.741069	413.949374
1958-03-01	44.0	362	-3.271831	358.537040	9.462509	385.002871	10.349717	424.299091
1958-04-01	-14.0	348	0.140311	358.677351	-9.970430	375.032441	-34.201682	390.097409
1958-05-01	15.0	363	2.762978	361.440329	2.501115	377.533556	-37.225431	352.871978

Ljung-Box (Q):	358.18	Jarque-Bera (JB):	2.78
Prob(Q):	0.00	Prob(JB):	0.25
Heteroskedasticity (H):	7.02	Skew:	0.33
Prob(H) (two-sided):	0.00	Kurtosis:	3.18

	No	year	month	day	hour	pm2.5	DEWP	TEMP	PRES	cbwd	Iws
0	1	2010	1	1	0	NaN	-21	-11.0	1021.0	NW	1.79
1	2	2010	1	1	1	NaN	-21	-12.0	1020.0	NW	4.92
2	3	2010	1	1	2	NaN	-21	-11.0	1019.0	NW	6.71
3	4	2010	1	1	3	NaN	-21	-14.0	1019.0	NW	9.84
4	5	2010	1	1	4	NaN	-20	-12.0	1018.0	NW	12.97

	pollution	dewp	temp	press	wnd_dir	wnd_spd	snow	rain
Date_time
2010-01-02 00:00:00	129.0	-16	-4.0	1020.0	SE	1.79	0	0
2010-01-02 01:00:00	148.0	-15	-4.0	1020.0	SE	2.68	0	0
2010-01-02 02:00:00	159.0	-11	-5.0	1021.0	SE	3.57	0	0
2010-01-02 03:00:00	181.0	-7	-5.0	1022.0	SE	5.36	1	0
2010-01-02 04:00:00	138.0	-7	-5.0	1022.0	SE	6.25	2	0

	pollution_pd	temp_pd
2014-12-29 00:00:00	336.200249	-2.314771
2014-12-29 01:00:00	328.570901	-2.434432
2014-12-29 02:00:00	314.027518	-2.823354

Dep. Variable:	['pollution', 'temp']	No. Observations:	43728
Model:	VARMA(3,2)	Log Likelihood	-280337.555
	+ intercept	AIC	560725.110
Date:	Mon, 25 Nov 2019	BIC	560942.253
Time:	21:55:22	HQIC	560793.555
Sample:	01-02-2010
	- 12-28-2014
Covariance Type:	opg

Ljung-Box (Q):	287.49, 6924.85	Jarque-Bera (JB):	24630445.23, 24469.33
Prob(Q):	0.00, 0.00	Prob(JB):	0.00, 0.00
Heteroskedasticity (H):	0.67, 1.12	Skew:	-0.71, -0.13
Prob(H) (two-sided):	0.00, 0.00	Kurtosis:	119.26, 6.65

Ljung-Box (Q):	178.45, 9453.30	Jarque-Bera (JB):	26565601.67, 34401.62
Prob(Q):	0.00, 0.00	Prob(JB):	0.00, 0.00
Heteroskedasticity (H):	0.67, 1.14	Skew:	-0.63, 0.11
Prob(H) (two-sided):	0.00, 0.00	Kurtosis:	123.74, 7.34

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	2.5392	0.157	16.165	0.000	2.231	2.847
ar.L1.y	0.9160	0.078	11.670	0.000	0.762	1.070
ar.L2.y	-0.7821	0.073	-10.642	0.000	-0.926	-0.638
ma.L1.y	-0.8360	0.109	-7.645	0.000	-1.050	-0.622
ma.L2.y	0.3351	0.151	2.226	0.028	0.040	0.630
ma.L3.y	0.2955	0.125	2.364	0.020	0.051	0.540
ma.L4.y	-0.7943	0.110	-7.240	0.000	-1.009	-0.579

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	0.6619	0.193	3.425	0.001	0.283	1.041
ar.L1	1.6479	0.028	58.889	0.000	1.593	1.703
ar.L2	-0.9096	0.025	-36.755	0.000	-0.958	-0.861
ma.L1	-1.9079	0.355	-5.371	0.000	-2.604	-1.212
ma.L2	0.9977	0.371	2.689	0.007	0.270	1.725
sigma2	611.7553	246.970	2.477	0.013	127.704	1095.807

Results for equation pollution
	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	4.3518	0.423	10.293	0.000	3.523	5.180
L1.pollution	0.9010	0.035	25.777	0.000	0.833	0.970
L1.temp	-1.1235	0.237	-4.742	0.000	-1.588	-0.659
L2.pollution	0.3664	0.056	6.511	0.000	0.256	0.477
L2.temp	0.0613	0.363	0.169	0.866	-0.651	0.773
L3.pollution	-0.2943	0.041	-7.152	0.000	-0.375	-0.214
L3.temp	1.0436	0.163	6.409	0.000	0.724	1.363
L1.e(pollution)	0.0474	0.035	1.347	0.178	-0.022	0.116
L1.e(temp)	0.5382	0.227	2.370	0.018	0.093	0.983
L2.e(pollution)	-0.2730	0.043	-6.389	0.000	-0.357	-0.189
L2.e(temp)	0.5793	0.122	4.758	0.000	0.341	0.818

Results for equation temp
	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-0.1679	0.043	-3.866	0.000	-0.253	-0.083
L1.pollution	-0.1428	0.008	-18.165	0.000	-0.158	-0.127
L1.temp	2.1278	0.037	56.787	0.000	2.054	2.201
L2.pollution	0.3137	0.010	31.345	0.000	0.294	0.333
L2.temp	-1.3697	0.087	-15.788	0.000	-1.540	-1.200
L3.pollution	-0.1697	0.005	-36.301	0.000	-0.179	-0.161
L3.temp	0.2409	0.051	4.741	0.000	0.141	0.341
L1.e(pollution)	0.1432	0.008	18.110	0.000	0.128	0.159
L1.e(temp)	-0.9669	0.037	-25.946	0.000	-1.040	-0.894
L2.e(pollution)	-0.1764	0.005	-36.073	0.000	-0.186	-0.167
L2.e(temp)	0.0298	0.044	0.673	0.501	-0.057	0.116