Uncategorized

# Understanding ARIMA time series analysis with R (part 2)

Part1 : Discuss under the stationarity – ARMA
Part2 : Non-stationarity and Seasonality – ARIMA and Seasonal ARIMA  (<- you are here)

# Non-stationary time series

In my previous post, I explained several ideas of AR, MA, ARMA under the condition of stationarity for the beginning of your understanding.

But as you know, almost in the real practical time-series (stock price, sales revenue, etc) is non-stationary model or mixed with stationary and non-stationary. In order to understand the non-stationary mode, it’s important to understand the ideas of stationarity. (Because the idea of non-stationary model is based on the stationary one.)

The idea of non-stationary model is : considering the series of differences , and estimating the stationarity of this series.
If the difference series is given as ARMA(p, q) (i.e, stationary model), we can determine the original . This model is called ARIMA (autoregressive integrated moving average) model, which is denoted by ARIMA(p, 1, q).
In general, ARIMA(p, d, q) is given as : the series of d-th order difference is ARMA(p, q).

As I described in my previous post, all MA model is stationary, and therefore I’ll focus on the AR part for considering non-stationary model.

As you can see below, the mathematical behavior of the non-stationary model is vastly different from the stationary one, even if it’s just a little difference of the parameters . (See below. As I described in my previous post, AR is stationary when .)
Unlike the stationary process, non-stationary process diverges in general. Thus we should change several mathematical approach (ideas) under the non-stationary one.

Sample of (stationary)

Sample of (non-stationary)

Sample of (non-stationary)

Identifying a model is also not easy.
Unlike the stationary process, we cannot use the t-distribution (which is often used in the statistical technique) for estimation under the condition of non-stationarity.

First we consider the following AR expression (1). (See my previous post for .)

The following equation (2) is called the characteristic equation of (1) :

This equation (2) follows :

• If all the absolute values of roots are larger than 1, (1) is stationary.
• If z = 1 is a root of multiplicity d (i.e, the characteristic equation is having ) and the others’ absolute values are all larger than 1, the series of d-th order difference of (1) is stationary.
Especially, if it’s only one unit root (d = 1), the series of  of (1) is stationary. It is called unit root process.

Now let’s consider using the simple case of AR(1), which is having the following equation. In this case, it’s stationary if , and it’s unit root process if .

If is stationary process (i.e, ), we can identify the model using the approach which I described in the previous post. If is unit root process (i.e, ), we can identify the model, because is stationary. If neither of both, you can repeat for the series .
Therefore it’s important to know whether it’s unit root process or stationary process, when the real sample data is given.

Under the unit root process, we cannot use t-distribution for estimation, because it is not stationary. Instead of that, we use Dickey-Fuller test in order to distinguish whether the mode is unit root process.

Here (in this post) I don’t describe details about this test technique, but the outline is :

1. Consider the following continuous stochastic process , where n is the size of given data and is standard deviation of . (Note that .)
Intuitively, this maps  as y on divided by each span .
2. When n is getting larger, converges to Standard Brownian Motion (Standard Wiener Process) on  in stochastic distribution.
3. Estimate and test the original unit root process based on this well-known distribution (Standard Brownian Motion) with sufficiently large n.

An extension of this testing to AR(p) is called Augmented Dickey–Fuller (ADF) test. (Here I don’t explain how to extend this technique.) You can use `adf.test()` (tseries package) in R for ADF test.

When you manually identify a model, you must follow the previous steps. But when you want the result in R, you can simply analyze without any difficulties. Same like the stationary one (see my previous post), you can simply use `auto.arima()` for identifying model.

``````library(forecast)

"C:\Demo\revenue_sample.txt",
col.names = c("DATETIME", "REVENUE"),
quote = """,
fileEncoding="UTF-8",
stringsAsFactors = FALSE)

# analyze
test = auto.arima(y\$REVENUE)
summary(test)``````

Here the “drift” (see the output above) is the trend of . Therefore the drift will be near the gradient of . (The following is the plotting of  and is surely having the linear trend.)

Hence the simulated model of will be given as follows :

and

As I described above, you must care the diffrence between stationarity and non-stationarity, even if you’re using R.
For example, when you predict (forecast) the future values under the non-stationarity, the predicted mean (conditional mean) will not converge to the mean itself (it will always move by drift) and the predicted variance will also grow more and more. (See below.)
Please compare the following predicted plotting (non-stationary result) with the one in my previous post (stationary result).

``````...

# analyze
test = auto.arima(y\$REVENUE)
summary(test)

# predict
pred <- forecast(
test,
level=c(0.85,0.95),
h=50)
plot(pred)``````

Here I don’t explain about the details, but you must also carefully apply some other machine learning algorithms (regression, anomaly detection, …), especially in the non-stationary time-series. (For example, see “Spurious relationship” in Wikipedia.)

# Backshift notation

Before discussing subsequent topics, let me introduce convenient notation called “backshift notation”.

With this notation we denote by . Of course, this doesn’t mean multiplying with some real number (). It’s just the notation !
Same like this, means .

Using this notation, we can easily treat the time-series differences by arithmetic operations.
For example, is meaning the difference of sequence as follows.

means the 2nd order difference as follows.

By repeating this operation, you can get p-th order difference as .

With this notation, ARIMA(1,1,0) is denoted by the following equation, where c is constant.

This notation clarifies the structure of model, because means the part of AR(1), and means the part of .
Thus, when you want to denote ARIMA(1,d,0), you can easily get the following representation.

Now we replace the previous expression (1) as follows with backshift notation.

As you can see, LHS (left-hand-side) in the last expression is the same format with expression (2).
If (2) can be represented by where all root of Y(z) are larger than 1 as absolute values, the series of d-th order difference of (1) is stationary. Because this process is denoted as follows. (Y(B) is stationary and is d-th order difference.)

As you can see here, you can easily get the previous consequence with backshift notation.

# Seasonal ARIMA modeling

With backshift notation, you can easily understand the background of seasonal ARIMA.
First we see the basic concept using simple AR(1) (without no differencing value d).

Suppose the sales revenue is so much affected by the revenue from the same month a year ago (12 months ago) and this relation is written by the following expression. (For simplicity, we don’t write the constant value c.)

Note : means the seasonal part of in AR. We often use upper character for seasonal part in order to distinguish with the non-seasonal part.

Suppose the revenue is also affected by the revenue from a month ago.
Then the mixed model is represented as follows. This is denoted by .

What does that model mean ? (Why multiply ?)

When we transform this representation, you can get the following.

As you can see, means the t’s revenue which is eliminating the effect of 12 months ago. Same like this, means the (t – 1)’s revenue which is eliminating the effect of 12 months ago.
Therefore this expression means “AR(1) model of one month ago, which is eliminating the effect of 12 months ago”.

This is also written by the following representation, and you can see it’s vice versa.

Here I showed the sample with no differences, but when it’s having the difference d, you just multiply (1 – B) by the backshift notation.
For example, is represented as follows :

• is AR(1) of non-seasonal part.
• is AR(1) of seasonal part.
• is the difference of non-seasonal part.
• is the difference of seasonal part.

Note : If you have MA(1), you can write on RHS as follows. (In this post we don’t discuss about MA part in seasonal ARIMA.)

When we transform the equation of , you can find that it is the same as ordinary ARIMA with lag 1, 12, and 13 as follows.
Hence if we don’t know the seasonality, we must estimate with all possible coefficients . (Please imagine if we have seasonality with seasonal lag = 365.)

The following is ACF and PACF in  with R. (See my previous post about ACF and PACF. Here I don’t explain what’s meaning ACF and PACF.)
As you can see in the following result, it’s having the spike at lags in both non-seasonal and seasonal. Thus ACF and PACF can also be used in order to identify a seasonal model.

Autocorrelation (ACF) sample

Partial Autocorrelation (PACF) sample

Let’s see the brief example in R.
Now we start to identify the model and forecast with model. As you can see in the following result, here we get the unexpected result without seasonal estimation. (The result is ARIMA(0,1,1) with drift.)

``````library(forecast)

"C:\Demo\seasonal_sample.txt",
col.names = c("DATETIME", "REVENUE"),
fileEncoding="UTF-8",
stringsAsFactors = FALSE)

# analyze
test <- auto.arima(y\$REVENUE)
summary(test)

# predict and plot
pred <- forecast(
test,
level=c(0.85,0.95),
h=100)
plot(pred)``````

Suppose we know that this dataset has the seasonality with lags 12 (for instance, 12 months).
Now we fix the previous R script to use this known seasonality as follows. (Here I changed the code in bold fonts.)
As you can see below, we can get the desirable result with seasonal ARIMA analysis.

``````library(forecast)

"C:\Demo\seasonal_sample.txt",
col.names = c("DATETIME", "REVENUE"),
fileEncoding="UTF-8",
stringsAsFactors = FALSE)

# analyze with seasonal s=12
test <- auto.arima(ts(data=y\$REVENUE, freq=12))
summary(test)

# predict and plot
pred <- forecast(
test,
level=c(0.85,0.95),
h=100)
plot(pred)``````