Overfitting in Regression and Neural Nets – Visually Understanding (Overview)

For your beginning of machine learning, here I show you some primitive overfitting example, and explain what you should care about and how to avoid. For building your intuitions, I show you several samples with many visual images.
First we see some simple overfitting examples for traditional statistical regression, and in the latter part we discuss about the case of neural network.

First look for Overfitting

Let me explain about overfitting in machine learning with a brief example of dataset as follows. (See the following plotting of sample data.)

sampledat

The following R script is my regression by linear model for above dataset (sampledat).
To fit precisely in the given data, here we use the formula by poly() function as follows.  If the result is , then you think  might be zero. You might think “the larger, the better”.

fit <- lm(
  y ~ poly(x1, 10, raw = TRUE),
  data = sampledat)

Here I show you the result for this regression analysis.

As you can see, we can get the following equation (1) as fitting equation for the given data.

Now we plot this equation with the given dataset. The equation (1) is best fitting with the given data as follows.

Is that really good result ?

In fact, this given dataset (sampledat) is generated by the following R script. As you can see below, this dataset is given by with some noise data. If the new data points are generated by this script, obviously these won’t fit into the equation (1).
That is, the result (equation (1)) is overfitting.

n_sample <- 20
b0 <- 3
b1 <- 1
x1 <- c(1:n_sample)
epsilon <- rnorm(
  n = n_sample,
  mean = 0,
  sd = 0.7)  # noise
y <- b0 + b1 * x1 + epsilon
sampledat <- data.frame(
  x1 = x1,
  y = y
)

Let’s see the equation (1) from your bird’s-eye. (See the following plotting of equation (1).)
The equation (1) is just fitting only for the given 20 data (the above “sampledat”), but not generalized one. The equation (1) doesn’t fit to unknown data.

Here we showed you a trivial overfitting example for your first understanding, but in the real practical case it’s difficult to distinguish whether it’s overfitting or not.
Our next interest is : How to distinguish ? How to avoid ?

Information Criterion

Now let’s say, you add the extra parameter into your regression formula. But the result of likelihood has become just a slightly little improved, or almost the same. If so, you may think that this new parameter might not be needed for this regression formula.

In the statistical approach, there exists the criterion (called “Information Criterion”) to judge your model fitting based on the mathematical background.
The famous one is Akaike Information Criterion (AIC) as follows. The smaller value is better fitting.

where is the number of estimated parameters and  is the maximum likelihood

Note : For both AIC and BIC (Bayesian information criterion) ideas, it’s given by . In AIC,  equals 2 (constant value).
Here we have only 20 survey data (observations) and it’s difficult to understand whether it’s noise or not. With BIC,  includes the number of survey data as parameter.

The following is the plot of values for log likelihood () and AIC for the previous given dataset (sampledat). The red line is the value of likelihood and blue line is AIC. (You can easily get these values with logLik() and AIC() function in R.)

As you can see, the appropriate number of estimated parameters is 3. That is, the formula is good for fitting. (See the following note.)

Note : The hidden estimated parameters (like the variance for Gaussian, the shape for Gamma, etc) must be counted as estimated parameters for AIC. In this case, we are using Gaussian, and we must add the variance for the estimated parameters. (See my previous post “Understanding the basis of GLM Regression” for details.)
For instance, if the formula (equation) is , then the estimated parameters are , and the variance (3 parameters).

For instance, if we use the following dataset, the number of parameters must be 4. Then the equation (formula) must be .

Here we use only single input (), but if you have several input parameters, you must also consider the interactions each other.

Overfitting in neural networks

Let’s proceed to the neural nets for discussion.

First you must remember that too many layers or neurons often cause the overfitting. Especially the layer will affect the complexity so much. (Of course, though the layers must be deep enough to represent your working model.)
Here also it’s not “the larger, the better”.

To simplify our example, let’s say here is brief feed-forward neural nets by sigmoid with two input variables () and one binary output (the output between 0 and 1).
If we have 1 hidden layer, it can represent the model as following illustrated. (The model can have several linear boundaries and these combination.)

If we have 2 hidden layers, it can represent more complex models as following illustrated. (These are the combination of 1 layer’s models.)

Granting that we have some noise data, 2 hidden layers’ network might cause the overfitting as follows.
As you can see here, the large layers will cause the overfitting.

Note : Unlike the statistical approach there’s no concrete criterion to decide how much is the best for layers or neurons, because no common evaluation property based on the mathematical model is there.
You must examine and evaluate the generated model with test data or validation data.

The model complexity is also caused by the large coefficients. Let’s see the next example.

As you know, the sigmoid has the following linear part and binary part. The linear part can smoothly fit, but the binary one doesn’t (binary fit).
As weights are increased, the binary part becomes more stronger than the linear part.

For example, let’s see the following illustrated network.

This network results into the following plotting (wire frame). ( is inputs, and z is output.) As you can see, it’s smoothly transitioning.

Let’s see the following next example.
This network is having exactly same boundary as previous one, but the coefficients (weights and bias) are so large.

When we plot the inputs () and outputs (), it becomes more sharp than before.

As weights are increased and it has enough layers and neurons, the model can easily produce more complex models. As a result it causes overfitting and the lack of generalization.

Large coefficients are easily be generated.
You just learn with too many training iterations (inputs, epoch, etc). Train ! Train ! Train ! The coefficient’s growth is caused by gradient descent.
For instance, the following is the simple feed-forward nets for recognizing hand-writing digit by mxnetR. This script outputs the variance of each layer’s weights.

require(mxnet)
...

# configure network
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

# train
model <- mx.model.FeedForward.create(
  softmax,
  X=traindata.x,
  y=traindata.y,
  ctx=mx.cpu(),
  num.round = 10,
  learning.rate=0.07)

# dump weights and biases
params <- model$arg.params

# check the variance of weights in each layers
var(c(as.array(params$fc1_weight)))
var(c(as.array(params$fc2_weight)))
var(c(as.array(params$fc3_weight)))

When we set num.round = 100 (see the above bold font) in this script, we can get more distributed large coefficients as follows.
In this example we’re setting the appropriate learning rate and the number of layers by the experimental results, but the weights will more rapidly increase with more high values of these parameters.

Note : Learning rate also affects to the weights and accuracy so much. Learning rate must be enough small to constantly decrease the differences in gradient descent, but it must be enough large to make it converge rapidly as possible. (You must decide with your experimental survey.)

epoch = 10

epoch = 100

There exist several regularization techniques to mitigate these overfitting in neural nets as follows.

  • Early Stopping – A method to stop learning when some condition occurs (ex: the condition when the error is higher than the last check, etc)
  • Penalty (L1, L2) – A method to set the penalty term for avoiding weight’s increase (weight decay penalty) in gradient descent evaluation
  • Dropout – A method to randomly drop the neurons in each training phase. By doing this, it avoids the overfitting of co-adaptation when it has so complex structure with many layers and neurons. As a result, it accomplishes the model combination (same like ensemble learning such as random forest etc).

The supported regularization method will differ from each framework.
For the libraries by Microsoft, you can implement all the regularization techniques (early stopping , penalty by L1 and L2, dropout) with CNTK, but rxNeuralNet (in MicrosoftML) and NET# doesn’t support.

# Dropout with CNTK (Python)
...

with default_options(activation=relu, pad=True):
  model = Sequential([
    LayerStack(2, lambda : [
      Convolution((3,3), 64),
      Convolution((3,3), 64),
      MaxPooling((3,3), strides=2)
    ]),
    LayerStack(2, lambda i: [
      Dense([256,128][i]), 
      Dropout(0.5)
    ]),
    Dense(4, activation=None)
  ])
...

 

Advertisements

Understanding the basis of GLM Regression (Logistic, Gaussian, Gamma, etc)

Now I’m working with several ISVs for building their solutions with some machine learning algorithms (using R, Python, Azure Machine Learning, etc).
More advanced and realistic regression (like GLMM, Bayesian and MCMC simulation, etc) for traditional statistical approach will be often needed in the practical modeling, but it’s very important to understand the basic modeling ideas of GLM (generalized linear models) for your first start, because advanced regression techniques are also based on these basic ones.

In this post I summarize the essence of these basic modeling ideas with simple R scripts. Please build your intuitions for your first start.

Note : We don’t discuss the mixed model including quasi-binomial, quasi-poisson or negative binomial in this post. (We discuss about only basic ones.)

Common Idea for Regression (GLM)

All GLM family (Gaussian, Poisson, etc) is based on the following common idea.

  • We choose the appropriate mathematical model (Gamma, Gaussian, etc) depending on what you need.
    For example, when you want to predict the human height (response variable) for some school children with weight and age (explanatory variables), it might be better to choose Gaussian, because the human height will be on normal distribution (Gaussian distribution).
  • All model family is having the input parameters. Next we represent this input parameters with explanatory variables.
    For example, the Gaussian distribution (normal distribution) is having the parameter (the mean) and (the standard deviation). If we suppose that is some constant value for any observation, is the only parameter depending on weight and age (explanatory variables). We suppose  where  is weight and  is age. (The expression will differ from each cases. Later I’ll explain about the link function, and please proceed to read.)
  • Finally we estimate and determine one , , and  for a lot of given data (weight, age, and human height).
    For example, we can get the approximated probability of the human height (because is determined by , , , and ), and finally you can estimate the appropriate , , and which maximizes the probability of actual height (observed response) with mathematical technique.

Now let’s see the each regression along with this common idea.

Logistic Regression (family = binomial)

Binomial regression with logit link function is called “Logistic Regression”. Here I show you the brief outline.

Suppose it’s having the probability for each success occurrence. Then it’s known that times success occurrence among independent experiments (i.e, times failure) is having the following probability. (This probability space is called binomial distribution.)

where

is the function by  and . But when and are given values, this is representing the function with unknown parameter .

where and are given values.

When a lot of sets of are given, we can define the probability with multiplication as follows. Because the multiplication of probability means the logical product.

Let’s consider the following case : Here we have sample small fishes in the fish tank, and we survey the survival rate depending on the amount of food and the amount of giving oxygen as follows.

Alive(y1) Dead(y2) Food(x1) Oxygen(x2)
39 11 300 2000
30 20 150 1200
12 38 100 1000
...

In previous function we’re supposing the parameter is always same. But, in this case, the parameter (which means the probability of survive) depends on food and oxygen: (This parameter   is the linear predictor in the previous illustration.)

Then the model is written as follows. As you can see, this is the function with parameters  for given .

That is, we should find the parameter which maximize .

In logistic regression, the following function is often used as  instead of . This is called the logistic link function (strictly speaking, the inverse of the following function is called the link function), and I describe about the link function later. (Please proceed to read here…)

As a result, the estimation function of the logistic regression is written as follows :

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and . The estimated results are -14.95783, 3.98255 and 1.99622. (See the following output.)

library(e1071) # for sigmoid func

#####
# create sample data with rbinom
#####
# y1 : number of alive
# y2 : number of dead
# x1 : food
# x2 : oxygen
#####
n_sample <- 1000
total <- 50
b0 <- -15
b1 <- 4
b2 <- 2
x1 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
probs <- sigmoid(b0 + b1 * x1 + b2 * x2)
y1 <- rbinom(
  n = n_sample,
  size = total,
  prob = probs)
y2 <- total - y1
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y1 = y1,
  y2 = y2
)

#####
# Estimate
#####
fit <- glm(
  cbind(y1, y2) ~ x1 + x2,
  data = sampledat,
  family = binomial(link = "logit"))
summary(fit)

Note : The default link function of binomial is “logit”. Then you do not need to specify link = "logit" explicitly. (Here I’m specifying explicitly for your understanding.)

Let’s see another example.
The following dataset is similar to the previous one (sampledat), but each row is having z column which identifies “Dead” (0) or “Alive” (1) for each small fishes, instead of   and .

sampledatA

Survival(z) Food(x1) Oxygen(x2)
1 300 2000
0 38 100 1000
1 300 2000
0 38 100 1000
0 300 2000
1 300 2000
1 38 100 1000
...

This dataset is written by the following format, and it’s the same as previous one. Then you can analyze using the previous code.

sampledatB

Alive(y1) Dead(y2) Food(x1) Oxygen(x2)
3 1 300 2000
1 2 100 1000
...

Using R, you don’t need to convert the dataset and you can just write as follows using z (0 or 1) column.

## with "Survival" column z (0 or 1)
fit2 <- glm(
  z ~ x1 + x2,
  data = sampledatA,
  family = binomial(link = "logit"))
summary(fit2)

If you’re using Microsoft R runtime, you can use rxLogit() in RevoScaleR or rxFastLinear in MicrosoftML for scaling and high-performance execution as follows.
Here I don’t describe about details, but with rxLogit() you can also specify the middle value in (0.7, 0,25, etc) for the response variable, and moreover you can use Frisch–Waugh–Lovell theorem (see Wikipedia) for categorical data.

fit3 <- rxLogit(
  z ~ x1 + x2,
  data = sampledat2)
summary(fit3)

Poisson Regression (family = poisson)

Suppose some event occurs times in some time interval. Then the probability () of times occurrence of the same event in the same interval is known as follows. (This is called Poisson distribution.)

Let’s consider the following example. : Here we’re planting some seeds in your garden, and we survey how many seeds germinate depending on the amount of water and fertilizer.
In this model, we consider that (the average of germinating seeds) depends on the water () and fertilizer (). That is, the parameter is the linear predictor.

In Poisson Regression, the following function is often used as  instead. The inverse of this function (i.e, ) is called the link function, and I describe the details about the link function later. (Please proceed to read here.)

Suppose a lot of data of , and are given as follows. (i.e, )

Germinating(k) Water(x1) Fertilizer(x2)
11 300 2000
8 150 1200
5 100 1000
...

Then we can define the total probability by multiplication as follows. (See above for the reason of multiplication.)
Here we should find the parameters which maximize the following possibility by the maximum likelihood estimation.

Note : In most cases, (i.e, log-likelihood) is used for the estimation, instead of using .

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and , and the estimated results are 0.215383, 0.498815 and 0.402523.

#####
# create sample data with rpois
#####
# y : germinating
# x1 : water
# x2 : fertilizer
#####
n_sample <- 1000
b0 <- 0.2
b1 <- 0.5
b2 <- 0.4
x1 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
lamdas <- exp(b0 + b1 * x1 + b2 * x2)
y <- rpois(
  n = n_sample,
  lambda = lamdas)
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"))
summary(fit)

Note : The default link function of poisson is “log”. Then you do not need to specify link = "log" explicitly.

If you’re using Microsoft R runtime, you can use rxGlm in RevoScaleR for scaled execution.

fit2 <- rxGlm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"))
summary(fit2)

Linear Regression (family = gaussian)

This regression is straight forward. This regression is represented by the following linear expression.

In the previous binomial and poisson regression, maximum likelihood estimation (MLE) is used for the parameter estimation.
But here, we can use the ordinary least squares (OLS) as follows. (The following is assuming that the only is used for explanatory variables.) Here we estimate the difference between the actual value and predicted value. The difference might be positive and negative. Then we calculate the square for each differences.

We should find the parameters which minimize the following E.

Why is it called “Gaussian” ?
Let’s consider the normal distribution (Gaussian distribution) . Suppose the standard deviation (the square root of variance) is some constant value for any observation, then this probability only depends on the parameter (the average).
Let’s consider the maximum likelihood estimation (MLE) for this Gaussian distribution. Suppose , and is having some small rage with . ( is fixed value.) If is very small, the product of likelihood will be defined as follows. (See above for the reason of multiplication.) :

Now we apply the logarithm for this likelihood (log-likelihood) as follows :

We can get the maximum value with this log-likelihood when we minimize . (Because and are constants.)
That is, the maximum likelihood estimation (MLE) for Gaussian distribution will result in the ordinary least squares (OLS) for the linear regression.

For example, it is known that the human height is on the with some fixed and . Suppose the average height  relies on human weight () and we suppose it’s having the linear relation.
In this case, you can get the regression result by OLS approach with .

Let’s see the following trivial example with R.
As you can see, we simulate with the value and , and the estimated results are 44.085248 and 0.139133.

#####
# create sample data
#####
# y : height (inch)
# x1 : weight (pond)
#####
n_sample <- 1000
b0 <- 44
b1 <- 0.14
x1 <- runif(
  n = n_sample,
  min = 45,
  max = 180)
epsilon <- rnorm(
  n = n_sample,
  mean = 0,
  sd = 1.57)  # noise (variance)
y <- b0 + b1 * x1 + epsilon
sampledat <- data.frame(
  x1 = x1,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1,
  data = sampledat,
  family = gaussian(link = "identity"))
summary(fit)

Note : link="identity" means the link function with . That is, the specific link function is not used. (The default link function of Gaussian is “identity”. Then you do not need to specify link = "identity" explicitly.)

If you’re using Microsoft R runtime, you can use rxLinMod in RevoScaleR or rxFastLinear in MicrosoftML for scaled and high-performance execution.

fit2 <- rxLinMod(
  y ~ x1,
  data = sampledat)
summary(fit2)

Gamma Regression (family = Gamma)

Suppose some event occurs times in unit (1) interval. Then the probability density function () for interval () with times occurrence of the same event is known as follows :

where is Gamma function. ( for any positive integer, and it is an extension for real and complex number. See “Wikipedia : Gamma function“.)

Note : You must keep in mind that it’s not probability itself, but it’s probability density function. When you estimate with MLE, you must use integral or get the probability with some fixed small range as I described above (see Gaussian Regression).

This regression is often used for the analytics for waiting time, time between failure, accident rate or something like that.
There exist two parameters (called “shape” and “rate” respectively), but we suppose is the same (constant) for any observation in this regression.

For example, let’s consider the time to the failure (breakdown) for some equipment, and suppose this occurrence depends on the number of initial bugs () and the number of claims by customers (). Suppose 1 failure (breakdown) is caused by 10 times errors. When the number of initial bugs increases, the errors in production would also increase. That is, when (the number of initial bugs) increases, (the count of errors for failure) is always constant, but (the count of errors for unit interval) increases. On the contrary, the average time decreases.

The linear predictor in Gamma regression is the average time , and it’s proportional to as a result. ( is called “scale”.)
Then the following inverse link function is often used for Gamma regression. (I describe the details about the link function later.)

Note : Here I don’t explain about details, but (shape) can also be estimated. (The variance of Gamma distribution is known as .)

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and , and the estimated results are 0.504847, 0.021330 and 0.008929.

#####
# create sample data with rgamma
#####
# y : time to failure
# x1 : number of initial bugs
# x2 : number of customer claims
#####
n_sample <- 1000
shape <- 10
b0 <- 0.5
b1 <- 0.02
b2 <- 0.01
x1 <- runif(
  n = n_sample,
  min = 1,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 11,
  max = 15)
rate <- (b0 + b1 * x1 + b2 * x2) * shape
y <- rgamma(
  n = n_sample,
  shape = shape,
  rate = rate)
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "inverse"))
summary(fit)

When you want to get the shape (s), you can use the following gamma.shape() in MASS package. (Then you can reproduce the original model by Gamma distribution.)

library(MASS)
...

# you can get shape as follows.
test <- gamma.shape(fit, verbose = TRUE)

If you’re using Microsoft R runtime, you can use rxGlm in RevoScaleR for scaled execution.

fit2 <- rxGlm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "inverse"))
summary(fit2)

About Link Functions

Now it’s time to expalin about the idea of the link function.
In this post I focus on the functions which we used in the previous example.

Log Link Function

Remember the previous Poisson Regression example.
Suppose the linear predictor (the average of germinating) as follows :

Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times.
Thus (1) is not representing the actual model.

Let’s compare with the following model. This is called log link, because the linear predictor () equals to .

When we replace with , , , we can get the following model (2).
As you can see, (1) is growing by the addition, but (2) is by the multiplication. (With the model (1), it could be negative number when or decreses. But it won’t be with model (2).)

The absolute answer doesn’t exist (it is case by case), but you must understand the difference of properties of these two models and use each link function along with your actual modeling.

Logit Link Function

The logit link is used for representing the values which is 0 or 1 (or in the middle between 0 and 1). Unlike log link, this doesn’t exceed 1 for any explanatory variables. This idea is also used in the neural networks or other recognition algorithms in mathematics.

Remember the previous Logistic Regression example. Now we consider the following logical model. :

  • Each input (food), (oxygen) is having the weight respectively.
  • Get the total weights by 
  • Compare with some fixed threshold b. If it’s exceed b, the result is 1. Otherwise the result is 0.

This model is representing the aspect of the actual decision mechanism. But one important weakness is, it is not continuous. (It’s discrete.)
If we suppose , the result is plotted as follows. As you can see, we cannot treat approximation approach, because it’s not continuous.

Therefore we introduce the following function called “sigmoid”.

As you can see below, sigmoid is very similar to the previous one, but it’s continuous.

Now we suppose , and we can get the probability (probability of alive) as follows.

The logit link function is the inverse of this sigmoid. :

Note : As you can see, this  is representing the probability of dead. This ratio of is called odds.

Inverse Link Function

 is called the inverse link function, where p is the predicted parameter.
Remember the previous Gamma Regression example. The predicted parameter is  where s is shape and r is rate, and s is constant. Therefore this is the same meaning as follows :

where

As I explained in Gamma regression, r (rate) means “the occurrence count in unit (1) interval”. Then means “the average interval between occurrence”. This is called “scale”.
If the linear predictor is proportional to the scale (i.e, inversely proportional to the rate), you can specify identity link function (i.e, no specific link function is used) for Gamma Regression as follows instead of inverse link function.

fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "identity"))

Offset

When you want to specify some input variable without predicted coefficient like , you can use offset.

For instance, now we introduce new input variable “observation count” () in the previous example of Poisson Regression.
Then the following linear predictor will be used.

In such a case, you can use offset as follows.

fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"),
  offset = log(x3))

You might think you can just divide the result  by instead of using offset. But you can analyze without changing the variance or some other statistical attributes by using offset.

 

Understanding ARIMA time series analysis with R (part 2)

Part1 : Discuss under the stationarity – ARMA
Part2 : Non-stationarity and Seasonality – ARIMA and Seasonal ARIMA  (<- you are here)

Non-stationary time series

In my previous post, I explained several ideas of AR, MA, ARMA under the condition of stationarity for the beginning of your understanding.

But as you know, almost in the real practical time-series (stock price, sales revenue, etc) is non-stationary model or mixed with stationary and non-stationary. In order to understand the non-stationary mode, it’s important to understand the ideas of stationarity. (Because the idea of non-stationary model is based on the stationary one.)

The idea of non-stationary model is : considering the series of differences , and estimating the stationarity of this series.
If the difference series is given as ARMA(p, q) (i.e, stationary model), we can determine the original . This model is called ARIMA (autoregressive integrated moving average) model, which is denoted by ARIMA(p, 1, q).
In general, ARIMA(p, d, q) is given as : the series of d-th order difference is ARMA(p, q).

As I described in my previous post, all MA model is stationary, and therefore I’ll focus on the AR part for considering non-stationary model.

As you can see below, the mathematical behavior of the non-stationary model is vastly different from the stationary one, even if it’s just a little difference of the parameters . (See below. As I described in my previous post, AR is stationary when .)
Unlike the stationary process, non-stationary process diverges in general. Thus we should change several mathematical approach (ideas) under the non-stationary one.

Sample of (stationary)

Sample of (non-stationary)

Sample of (non-stationary)

Identifying a model is also not easy.
Unlike the stationary process, we cannot use the t-distribution (which is often used in the statistical technique) for estimation under the condition of non-stationarity.

First we consider the following AR expression (1). (See my previous post for .)

The following equation (2) is called the characteristic equation of (1) :

This equation (2) follows :

  • If all the absolute values of roots are larger than 1, (1) is stationary.
  • If z = 1 is a root of multiplicity d (i.e, the characteristic equation is having ) and the others’ absolute values are all larger than 1, the series of d-th order difference of (1) is stationary.
    Especially, if it’s only one unit root (d = 1), the series of  of (1) is stationary. It is called unit root process.

Now let’s consider using the simple case of AR(1), which is having the following equation. In this case, it’s stationary if , and it’s unit root process if .

If is stationary process (i.e, ), we can identify the model using the approach which I described in the previous post. If is unit root process (i.e, ), we can identify the model, because is stationary. If neither of both, you can repeat for the series .
Therefore it’s important to know whether it’s unit root process or stationary process, when the real sample data is given.

Under the unit root process, we cannot use t-distribution for estimation, because it is not stationary. Instead of that, we use Dickey-Fuller test in order to distinguish whether the mode is unit root process.

Here (in this post) I don’t describe details about this test technique, but the outline is :

  1. Consider the following continuous stochastic process , where n is the size of given data and is standard deviation of . (Note that .)
    Intuitively, this maps  as y on divided by each span .
  2. When n is getting larger, converges to Standard Brownian Motion (Standard Wiener Process) on  in stochastic distribution.
  3. Estimate and test the original unit root process based on this well-known distribution (Standard Brownian Motion) with sufficiently large n.

An extension of this testing to AR(p) is called Augmented Dickey–Fuller (ADF) test. (Here I don’t explain how to extend this technique.) You can use adf.test() (tseries package) in R for ADF test.

When you manually identify a model, you must follow the previous steps. But when you want the result in R, you can simply analyze without any difficulties. Same like the stationary one (see my previous post), you can simply use auto.arima() for identifying model.

library(forecast)

# read data
y <- read.table(
  "C:\Demo\revenue_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  quote = """,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze
test = auto.arima(y$REVENUE)
summary(test)

Here the “drift” (see the output above) is the trend of . Therefore the drift will be near the gradient of . (The following is the plotting of  and is surely having the linear trend.)

Hence the simulated model of will be given as follows :

and

As I described above, you must care the diffrence between stationarity and non-stationarity, even if you’re using R.
For example, when you predict (forecast) the future values under the non-stationarity, the predicted mean (conditional mean) will not converge to the mean itself (it will always move by drift) and the predicted variance will also grow more and more. (See below.)
Please compare the following predicted plotting (non-stationary result) with the one in my previous post (stationary result).

...

# analyze
test = auto.arima(y$REVENUE)
summary(test)

# predict
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=50)
plot(pred)

Here I don’t explain about the details, but you must also carefully apply some other machine learning algorithms (regression, anomaly detection, …), especially in the non-stationary time-series. (For example, see “Spurious relationship” in Wikipedia.)

Backshift notation

Before discussing subsequent topics, let me introduce convenient notation called “backshift notation”.

With this notation we denote by . Of course, this doesn’t mean multiplying with some real number (). It’s just the notation !
Same like this, means .

Using this notation, we can easily treat the time-series differences by arithmetic operations.
For example, is meaning the difference of sequence as follows.

means the 2nd order difference as follows.

By repeating this operation, you can get p-th order difference as .

With this notation, ARIMA(1,1,0) is denoted by the following equation, where c is constant.

This notation clarifies the structure of model, because means the part of AR(1), and means the part of .
Thus, when you want to denote ARIMA(1,d,0), you can easily get the following representation.

Now we replace the previous expression (1) as follows with backshift notation.

As you can see, LHS (left-hand-side) in the last expression is the same format with expression (2).
If (2) can be represented by where all root of Y(z) are larger than 1 as absolute values, the series of d-th order difference of (1) is stationary. Because this process is denoted as follows. (Y(B) is stationary and is d-th order difference.)

As you can see here, you can easily get the previous consequence with backshift notation.

Seasonal ARIMA modeling

With backshift notation, you can easily understand the background of seasonal ARIMA.
First we see the basic concept using simple AR(1) (without no differencing value d).

Suppose the sales revenue is so much affected by the revenue from the same month a year ago (12 months ago) and this relation is written by the following expression. (For simplicity, we don’t write the constant value c.)

Note : means the seasonal part of in AR. We often use upper character for seasonal part in order to distinguish with the non-seasonal part.

Suppose the revenue is also affected by the revenue from a month ago.
Then the mixed model is represented as follows. This is denoted by .

What does that model mean ? (Why multiply ?)

When we transform this representation, you can get the following.

As you can see, means the t’s revenue which is eliminating the effect of 12 months ago. Same like this, means the (t – 1)’s revenue which is eliminating the effect of 12 months ago.
Therefore this expression means “AR(1) model of one month ago, which is eliminating the effect of 12 months ago”.

This is also written by the following representation, and you can see it’s vice versa.

Here I showed the sample with no differences, but when it’s having the difference d, you just multiply (1 – B) by the backshift notation.
For example, is represented as follows :

  • is AR(1) of non-seasonal part.
  • is AR(1) of seasonal part.
  • is the difference of non-seasonal part.
  • is the difference of seasonal part.

Note : If you have MA(1), you can write on RHS as follows. (In this post we don’t discuss about MA part in seasonal ARIMA.)

When we transform the equation of , you can find that it is the same as ordinary ARIMA with lag 1, 12, and 13 as follows.
Hence if we don’t know the seasonality, we must estimate with all possible coefficients . (Please imagine if we have seasonality with seasonal lag = 365.)

The following is ACF and PACF in  with R. (See my previous post about ACF and PACF. Here I don’t explain what’s meaning ACF and PACF.)
As you can see in the following result, it’s having the spike at lags in both non-seasonal and seasonal. Thus ACF and PACF can also be used in order to identify a seasonal model.

Autocorrelation (ACF) sample

Partial Autocorrelation (PACF) sample

Let’s see the brief example in R.
Now we start to identify the model and forecast with model. As you can see in the following result, here we get the unexpected result without seasonal estimation. (The result is ARIMA(0,1,1) with drift.)

library(forecast)

# read data
y <- read.table(
  "C:\Demo\seasonal_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze
test <- auto.arima(y$REVENUE)
summary(test)

# predict and plot
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=100)
plot(pred)

Suppose we know that this dataset has the seasonality with lags 12 (for instance, 12 months).
Now we fix the previous R script to use this known seasonality as follows. (Here I changed the code in bold fonts.)
As you can see below, we can get the desirable result with seasonal ARIMA analysis.

library(forecast)

# read data
y <- read.table(
  "C:\Demo\seasonal_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze with seasonal s=12
test <- auto.arima(ts(data=y$REVENUE, freq=12))
summary(test)

# predict and plot
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=100)
plot(pred)

Understanding ARIMA time series analysis with R (part 1)

Part 1 : Discuss under the stationarity – ARMA  (<- you are here)
Part2 : Non-stationarity and Seasonality – ARIMA and Seasonal ARIMA

In the beginning …

The time-series analysis is frequently needed in the practical data analysis, and the skill could help you solve many real problems for all developers.
For example, almost all marketing or social analysis will have some kind of seasonality. Even if you analyze your application’s access logs, it might have some “peak time” (trends of access frequency) or something.

Recognizing time-series analysis is based on the mathematical background. However, throughout in this post, I focus on explaining why things are needed on building your intuitions shortly. (For a lot of descriptions, I will write only consequence without proofs.) I also won’t use the difficult terminologies as possible, but I note that the primary mathematical knowledge might be needed for this reading.

Before starting, we have some notice to say.
First, the time-series analysis is having several approaches like the spectrum analysis (Fourier Transform), exponential smoothing, etc… Here we discuss using ARIMA which is based on the mathematical model which represents many aspects of real time-series. (The famous GARCH is also based on this ARIMA.) Here we use “forecast” package for ARIMA analysis in R.
The second, there’re so many things (a lot of backgrounds) to say, and I cannot write it once. Therefore, first I focus on the fundamentals and basic ideas about ARIMA under the condition of stationary process in this post, and next we will discuss more practical topics including non-stationarity.

Note : In this post, we don’t care the index of time, like “day of week”, “quarter”, etc. (We use like 1, 2, 3, … as time axis.) If you want to handle these time signature, timekit package can help you analyze time-series.

ARMA Modeling

As I mentioned above, here we assume the stationarity of time-series data.
The “stationarity” means : the mean (expected value) is not dependant for time and is constant, and the autocovariance only depends on the time differences, not depends on t. That is, the stationary process has the following restriction. (One important notice is that the real time-series data might be measured once and the values are fixed. But here we should consider the mathematical model to fit our time-series data, and therefore we suppose  is not some fixed values, but consists of the statistical probability space.)

Note : Strictly speaking, the above definition is called “weak-sense stationary process”. This definition is often used in the practical statistical analysis rather than “strongly stationary process”.  When we use the term “stationarity” through the posts, it means “weak-sense stationarity”.
See “Stationary process (Wikipedia)“.

Intuitively, it implies that the stochastic behavior of value’s motion is always same in any time t. For a brief example, it is not stationary process, if values increase more and more as time passes, because the mean is not constant and depends on t. (In this case, the differences might behave some steady model, and we discuss this case in the next post.)

Now we consider the mathematical model to fit our time-series data under the condition of stationarity. Let’s consider the stock price data. (The stock price would be non-stationary, but here we suppose it’s stationary.)

As you know, the stock price is influenced by the yesterday’s price. If the price (close price) of t – 1 is y dollars, the price of day t starts with y dollars.
That is, we can assume that the price will be determined by the following model, where c and is constant, and is randomly distributed noise data called “white noise”. (For this , both the mean and the autocovariance equal to 0. But it’s okay to suppose  as  to make things easy.)

Note : If is negative value, and is having the correlation of the reverse. (When is large, will be negatively large.) If then, the graph will be so jagged.

This model is called AR (Autoregressive), and AR(p) is given as the following definition in general. (Therefore the above stock model is AR(1).)

AR model can represent many aspects of cyclic stationarity.
For example, the following R program is plotting AR model with . As you can see, it draws the beautiful line with cycles.

data_count <- 300
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y)

Note : You can simply use arima.sim() (in “forecast” package) for ARIMA (including ARMA) simulation. But here we don’t use this function for your understanding.

You remember that not all AR expression is stationary process, and AR model has the stationarity, only if some condition is established.
In AR(1), it has stationarity if and only if . In AR(2), it has stationarity if and only if and .

Now let’s consider some other model.
For stock price, it might also be affected by the difference between open price and close price on the day before. If there was a rapid decline in stock price on the day t, the price might rebound on the day t + 1.
In such a case, we can use the following model called MA (moving average).

The part of (μ is the mean) is the same like AR, but MA depends on the t-1 noise , instead of .

Same as AR, the general MA(q) expression is given as follows.
It is known that MA(q) is always stationary process.

Let’s see how the graph will change with values.
As you can see below, it will smoothly transit when is positively related. It will be jagged when it is negatively related.

Normal Distribution (No MA)

MA(3) with

MA(2) with

data_count <- 300
mu <- 70
# sample1
#theta1 <- 0
#theta2 <- 0
#theta3 <- 0
# sample2
theta1 <- 1
theta2 <- 1
theta3 <- 1
# sample3
#theta1 <- -1
#theta2 <- 1
#theta3 <- 0
e <- rnorm(data_count) * 10
y1 <- mu + e[1]
y2 <- mu + e[2] + (theta1 * e[1])
y3 <- mu + e[3] + (theta1 * e[2]) + (theta2 * e[1])
y <- data.frame(
  Indexes=c(1, 2, 3),
  Values=c(y1, y2, y3))
for (i in 4:data_count) {
  yi <- mu + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2]) + (theta3 * e[i-3])
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")

Finally, the combined model is called ARMA model, and it’s given as follows.
As you can see below, the former part is AR(p) and the latter is MA(q).

Estimate – Choose your model and parameters

Now we consider the prediction of time-series data. First we select (identify) the appropriate model and parameters (p, q, , , etc).

When you select the model (AR, MA, ARMA) and dimensions (p, q), you must focus on the charactristics of each models. Each model is having the different characteristics for autocorrelation (ACF) and partial autocorrelation (PACF).

Autocorrelation (Corr) is the autocovariance divided by variance as follows. By dividing with variance, it’s not be affected by the size (quantity) of unit.
As you know, Corr only depends on the lag k (not depend on t), because Cov depends only on k. That is, the autocorrelation means the effect (correlation) by the lag k,

On the other hand, the partial autocorrelation is the value of the effect by the lag k but it eliminates the effect of the lag 1, 2, …, k-1. It is given by calculating the linear equations for the covariance. (It needs the calculation of matrices, but we don’t describe the definition of the partial autocorrelation in this post.)

Each AR(p), MA(q), and ARMA(p, q) model is having the following different characteristics for the autocorrelation and partial autocorrelation.
For example, if the partial autocorrelation of samples equals to 0 when the lag >= 3, there is high probability that the model is AR(2).

  • AR(p) : When the lag is getting large, the autocorrelation decreases exponentially (but, non-0 value). When the lag is larger than p, the partial autocorrelation equals to 0.
  • MA(q) : When the lag is larger than q, the autocorrelation equals to 0. When the lag is getting large, the partial autocorrelation decreases exponentially (but, non-0 value).
  • ARMA(p, q) : When the lag is getting large, both autocorrelation and partial autocorrelation decreases exponentially (but, non-0 value).
autocorrelation (ACF) partial autocorrelation (PACF)
AR(p) decrease exponentially 0 if >= p + 1
MA(q) 0 if >= q + 1 decrease exponentially
ARMA(p, q) decrease exponentially decrease exponentially

Note : You may wonder… Why the partial autocorrelation of MA is not 0, even though Corr is 0 when the lag >= q + 1 ?
The reason is because of the difference between definition of autocorrelation and partial autocorreletion, and it’s not so important.

The following is the autocorrelation and partial autocorrelation plotting sample with R. (The x axis is lags, and the y axis is correlation values)

library(forecast)

data_count <- 10000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}

# Autocorrelation plotting
acf(
  y$Values,
  lag.max = 80,
  type = c("correlation"),
  plot = T)
# Partial Autocorrelation plotting
pacf(
  y$Values,
  lag.max = 80,
  plot = T)

Autocorrelation sample of AR(2)

Partial Autocorrelation sample of AR(2)

data_count <- 10000
mu <- 70
theta1 <- 1
theta2 <- -1
e <- rnorm(data_count) * 10
y1 <- mu + e[1]
y2 <- mu + e[2] + (theta1 * e[1])
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- mu + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2])
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
# Autocorrelation plotting
acf(
  y$Values,
  lag.max = 80,
  type = c("correlation"),
  plot = T)
# Partial Autocorrelation plotting
pacf(
  y$Values,
  lag.max = 80,
  plot = T)

Autocorrelation sample of MA(2)

Partial Autocorrelation sample of MA(2)

Note : As you can see, it’s so hard to distinguish whether it equals to 0 or is decreasing exponentially, because the values are approximated.
In the real estimation, AIC or BIC (information criteria) is often used for validating the choice (the count of parameters) of models.

For the estimation of parameter values, you can use the estimation techniques like moment, least squares, or maximum likelihood. For example, using the given samples , you can determine the parameter values (, etc) which maximize the likelihood of . (We assume  represents the normal distribution.)

After you determine the model and parameters, you can survey the sample (the given real data) with the normal techniques of the statistical sample survey.

With R, you can analyze without any difficulties using auto.arima() in “forecast” package. The following is the brief example, in which we create the data with AR(2) () and analyze with auto.arima().
As you can see below, we can get the good result (). (Here ARIMA(p, 0, q) means ARMA(p, q).)

library(forecast)

# create time-series data with ARMA(2, 0)
data_count <- 1000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}

# analyze
test <- auto.arima(y$Values, seasonal = F)
summary(test)

Let’s consider another example with R.
The following data aligns with sin() (with some noise), and it will be easy to analyze with spectrum technique, but here we shall represent it as similar stationary model with auto.arima().

library(forecast)

data_count <- 200
y <- data.frame(
  Indexes=c(1:data_count),
  Noise=rnorm(data_count))
y$Values <-  150 + (100 * sin(pi * y$Indexes / 10)) + (5 * y$Noise)

plot(x = y$Indexes, y = y$Values)
test <- auto.arima(y$Values, seasonal = F)
summary(test)

The following is the ARMA simulation with these parameters. (As I described above, you can use arima.sim() for simulation, but here we don’t use this function for your understanding.)
As you can see below, this model seems to be fitted closely by the cycle of given data (real data).

library(forecast)

data_count <- 200
phi1 <- 1.1690
phi2 <- 0.3636
phi3 <- -0.6958
c <- 149.9833 * abs(1 - phi1)
e <- rnorm(data_count, sd = sqrt(66.94))
y1 <- c + (phi1 * c) + e[1]
y2 <- c + (phi1 * y1) + e[2]
y3 <- c + (phi1 * y2) + (phi2 * y1) + e[3]
y <- data.frame(
  Indexes=c(1, 2, 3),
  Values=c(y1, y2, y3))
for (i in 4:data_count) {
  yi <- c + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + (phi3 * y$Values[i-3]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")

These (above) are the primitive examples, but now we change the original data as follows.
This data is much affected by the white noise compared with sin() cycle.

y <- data.frame(
  Indexes=c(1:2000),
  Noise=rnorm(2000))
y$Values <-  20 + (10 * sin(y$Indexes / 30)) + (5 * y$Noise)

plot(x = y$Indexes, y = y$Values)
test <- auto.arima(y$Values, seasonal = F)
summary(test)

The following is the result which we analyze with auto.arima().

When you simulate the result, you can find the difference (errors) between the original data and the simulated one.
The effect of noise will affect the result of estimation. (You must care the probability of errors in the real practical analysis.)

library(forecast)

data_count <- 2000
phi1 <- 0.9865
theta1 <- -0.8565
theta2 <- -0.0053
theta3 <- 0.0187
theta4 <- -0.0188
theta5 <- 0.0864
c <- 20.4084 * abs(1 - phi1)
e <- rnorm(data_count, sd = sqrt(28.21))
y1 <- c + (phi1 * c) + e[1]
y2 <- c + (phi1 * y1) + e[2] + (theta1 * e[1])
y3 <- c + (phi1 * y2) + e[3] + (theta1 * e[2]) + (theta2 * e[1])
y4 <- c + (phi1 * y3) + e[4] + (theta1 * e[3]) + (theta2 * e[2]) + (theta3 * e[1])
y5 <- c + (phi1 * y4) + e[5] + (theta1 * e[4]) + (theta2 * e[3]) + (theta3 * e[2]) + (theta4 * e[1])
y <- data.frame(
  Indexes=c(1, 2, 3, 4, 5),
  Values=c(y1, y2, y3, y4, y5))
for (i in 6:data_count) {
  yi <- c + (phi1 * y$Values[i-1]) + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2]) + (theta3 * e[i-3]) + (theta4 * e[i-4]) + (theta5 * e[i-5])
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")

In these examples I used the data with the trivial sin() cycles for your understanding, but you must remember that our goal is to give the model of maximum possibility with explanatory variables, not to give the similar graph.
Even though it’s same model (the model is correct), the graph could be vastly different in the simulation, because it’s including stochastic uncertain values  and it causes subsequent differences. The important thing is “modeling”, not “drawing”.

When you want to involve some consideration of known periodical cycles, trend, or noise for your analysis, you can mix the approach to get the moving trend (by ma() function) or something, and then you could analyze only targeting trends or remainders. For example, the following original data (see “data” section) is having the pair of cycles with small peak and large peak repeatedly, and is also having the big trend wave with rising and falling. (Here we have decomposed with stl() function as follows.)

 

Predict the future

When the model and parameters are determined, now you can predict the future values with the generated model.

The predicted values are called “conditional expected value” or “conditional mean”, because we predict the future values  with  distribution under the condition of the given values (the values in the past)  . (We assume that represents the normal distribution.)

On the contrary, the prediction approach to show some interval of possibility is called “interval forecast”. For example, showing the interval of 95th percentile.
This possibility zone (the variance called mean squared errors (MSE)) is getting larger, when the time goes to the future. With interval forecast, you can visually know these possibility trends.

In computing process, the probability distribution will not be calculated algebraically, but the sufficiently large number of values with the normal distribution are generated, and it’s approximated. (Because of the computing cost.) This kind of approach is often used in the real computer system. (See Gradient descent, Newton’s method, or Monte Carlo, etc.)

The following is the R example of this prediction (forecast) with primitive AR(2).
The line is the conditional expected value, inner range is the interval forecast with the possibility of 85th percentile, and outer range is the one with the possibility of 95th percentile.

library("forecast")

# create data with AR(2)
data_count <- 1000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}

# estimate the model
test <- auto.arima(y$Values, seasonal = F)
summary(test)


# plot with prediction
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=100)
plot(pred)

As you can see in the graph, the conditional expected value of AR approaches (converges) to the expected value (the mean) itself of this model. (And MSE converges to the variance.)
Therefore it’s meaningless to forecast further future, because it will just be the same value as the expected value (the mean) of the model.

Next we will discuss more advanced topics about time series analysis including non-stationarity.

Walkthrough of Azure Cosmos DB Graph (Gremlin)

I show you the general tasks of Azure Cosmos DB Gremlin (graph) for your first use, and a little bit dive into the practical usage of graph query.

This post will also help the beginner to learn how to use the graph database itself, since Azure Cosmos DB is one of the compliant database to the popular graph framework “Apache TinkerPop“.

Create your database and collection

First you must prepare your database.

With Azure Cosmos DB, you must provision account, database, and collection just like Azure Cosmos DB NoSQL database.
You can create these objects using API (REST or SDK), but here we use UI of Azure Portal.

When you create Azure Cosmos DB account in Azure Portal, you must select “Gremlin (graph)” as the supported API as the following picture.

After you’ve created your account, next you create your database and collection with Data Explorer as follows.
The following “Database id” is the identifier for a database, and “Graph id” is for a collection.

Calling API

Before calling APIs, please copy your account key in your Azure Portal.

Moreover please copy the Gremlin URI in your Azure Portal.

Now you can develop your application with APIs.

As I mentioned before, Azure Cosmos DB is one of the database that is compliant with TinkerPop open source framework. Therefore you can use a lot of existing tools compliant with TinkerPop. (See “Apache TinkerPop” for language drivers and other tools.)
For example, if you use PHP for your programming, you can easily download and install gremlin-php by the following command. (You can also use App Service Editor in Azure App Services. No need to prepare your local physical machine for trial !)

curl http://getcomposer.org/installer | php
php composer.phar require brightzone/gremlin-php "2.*"

The following is the simple PHP program code which is retrieving all vertices (nodes) in graph database. (Later I explain about the gremlin language.)

Note that :

  • host is the host name of the previously copied gremlin uri.
  • username is /dbs/{your database name}/colls/{your collection name}.
  • password is the previously copied account key.
<?php
require_once('vendor/autoload.php');
use BrightzoneGremlinDriverConnection;

$db = new Connection([
  'host' => 'graph01.graphs.azure.com',
  'port' => '443',
  'graph' => 'graph',
  'username' => '/dbs/db01/colls/test01',
  'password' => 'In12qhzXYz...',
  'ssl' => TRUE
]);
$db->open();
$res = $db->send('g.V()');
$db->close();

// output the all vertex in db
var_dump($res);
?>

The retrieved result is so called GraphSON format as follows. In this PHP example, the result will be serialized to the PHP array with the following same format.

{
  "id": "u001",
  "label": "person",
  "type": "vertex",
  "properties": {
    "firstName": [
      {
        "value": "John"
      }
    ],
    "age": [
      {
        "value": 45
      }
    ]
  }
}

You can also use SDK for Azure Cosmos DB graph (.NET, Java, Node.js), which is specific for Azure Cosmos DB.
Especially, if you need some specific operations for Azure Cosmos DB (ex: creating or managing database, collections, etc), it’s better to use this SDK.

For example, the following is the C# example code using Azure Cosmos DB Graph SDK. (Here the gremlin language is also used. Later I explain about the details.)
Note that the endpoint differs from the previous gremlin uri.

using Microsoft.Azure.Documents;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.Documents.Linq;
using Microsoft.Azure.Graphs;
using Newtonsoft.Json;

static void Main(string[] args)
{
  using (DocumentClient client = new DocumentClient(
    new Uri("https://graph01.documents.azure.com:443/"),
    "In12qhzXYz..."))
  {
    DocumentCollection graph = client.CreateDocumentCollectionIfNotExistsAsync(
      UriFactory.CreateDatabaseUri("db01"),
      new DocumentCollection { Id = "test01" },
      new RequestOptions { OfferThroughput = 1000 }).Result;
    // drop all vertex
    IDocumentQuery<dynamic> query1 =
      client.CreateGremlinQuery<dynamic>(graph, "g.V().drop()");
    dynamic result1 = query1.ExecuteNextAsync().Result;
    Console.WriteLine($"{JsonConvert.SerializeObject(result1)}");
    // add vertex
    IDocumentQuery<dynamic> query2 =
      client.CreateGremlinQuery<dynamic>(graph, "g.addV('person').property('id', 'u001').property('firstName', 'John')");
    dynamic result2 = query2.ExecuteNextAsync().Result;
    Console.WriteLine($"{JsonConvert.SerializeObject(result2)}");
  }

  Console.WriteLine("Done !");
  Console.ReadLine();
}

Before building your application with Azure Cosmos DB SDK, you must install Microsoft.Azure.Graphs package with NuGet. (Other dependent libraries like Microsoft.Azure.DocumentDB, etc are also installed in your project.)

Interactive Console and Visualize

As I described above, TinkerPop framework is having the various open source utilities contributed by communities.

For example, if you want to run the gremlin language (query, etc) with the interactive console, you can use Gremlin Console.
Please see the official document “Azure Cosmos DB: Create, query, and traverse a graph in the Gremlin console” for details about Gremlin Console with Azure Cosmos DB.

There’re also several libraries or software for visualizing gremlin-compatibile graph in Tinkerpop framework.

If you’re using Visual Studio and Azure Cosmos DB, the following Github sample source (written as ASP.NET web project) is very easy to use for visualizing Azure CosmosDB graph.

[Gitub] Azure-Samples / azure-cosmos-db-dotnet-graphexplorer
https://github.com/Azure-Samples/azure-cosmos-db-dotnet-graphexplorer

Gremlin Language

As you’ve seen in my previous programming example, it’s very important to understand the gremlin language (query, etc) for your practical use.
Let’s dive into the gremlin language (query, etc), which is not deep, but practical level of understanding.

First, we simply create the vertex (node).
The following is creating 2 vertices of “John” and “Mary”.

g.addV('employee').property('id', 'u001').property('firstName', 'John').property('age', 44)
g.addV('employee').property('id', 'u002').property('firstName', 'Mary').property('age', 37)

The following is creating the edge between 2 vertices of John and Mary. (This sample means that John is a manager for Mary.)
As you can see, you can specify (identify) the targeting vertex with the previous “id” property.

g.V('u002').addE('manager').to(g.V('u001'))

In this post, we use the following simple structure (vertices and edges) for our subsequent examples.

g.addV('employee').property('id', 'u001').property('firstName', 'John').property('age', 44)
g.addV('employee').property('id', 'u002').property('firstName', 'Mary').property('age', 37)
g.addV('employee').property('id', 'u003').property('firstName', 'Christie').property('age', 30)  
g.addV('employee').property('id', 'u004').property('firstName', 'Bob').property('age', 35)
g.addV('employee').property('id', 'u005').property('firstName', 'Susan').property('age', 31)
g.addV('employee').property('id', 'u006').property('firstName', 'Emily').property('age', 29)
g.V('u002').addE('manager').to(g.V('u001'))
g.V('u005').addE('manager').to(g.V('u001'))
g.V('u004').addE('manager').to(g.V('u002'))
g.V('u005').addE('friend').to(g.V('u006'))
g.V('u005').addE('friend').to(g.V('u003'))
g.V('u006').addE('friend').to(g.V('u003'))
g.V('u006').addE('manager').to(g.V('u004'))

The following is the example which retrieves vertices with some query conditions. This retrieves the employees whose age is greater than 40. (If you query edges, use g.E() instead of g.V().)

g.V().hasLabel('employee').has('age', gt(40))

As I described above, the retrieved result is so called GraphSON format as follows.

{
  "id": "u001",
  "label": "employee",
  "type": "vertex",
  "properties": {
    "firstName": [
      {
        "id": "9a5c0e2a-1249-4e2c-ada2-c9a7f33e26d5",
        "value": "John"
      }
    ],
    "age": [
      {
        "id": "67d681b1-9a24-4090-bac5-be77337ec903",
        "value": 44
      }
    ]
  }
}

You can also use the logical operation (and(), or()) for the graph query.
For example, the following returns only “Mary”.

g.V().hasLabel('employee').and(has('age', gt(35)), has('age', lt(40)))

Next we handle the traversals. (You can traverse the edge.)
Next is the simple traversal example, which just retrieves Mary’s manager. (The result will be “John”.)

g.V('u002').out('manager').hasLabel('employee')

Note that the following returns the same result. The operation outE() returns the edge element and is getting the incoming vertex by inV(). (Explicitly traversing elements, vertex -> edge -> vertex.)

g.V('u002').outE('manager').inV().hasLabel('employee')

The following retrieves Mary’s manager (i.e, “John”) and retrieves the all employees whose direct report is him (“John”).
The result will be “Mary” and “Susan”.

g.V('u002').out('manager').hasLabel('employee').in('manager').hasLabel('employee')

If you want to omit the repeated elements in path, you can use simplePath() as follows. This returns only “Susan”, because “Mary” is the repeated vertex.

g.V('u002').out('manager').hasLabel('employee').in('manager').hasLabel('employee').simplePath()

Now let’s consider the traversal of the relation “friend”. (See the picture illustrated above.)
As you know, “manager” is the directional relation, but “friend” will be the undirectional (non-directional) relation. That is, if A is a friend of B, B will also be a friend of A.
In such a case, you can use both() (or bothE()) operation as follows. The following retrieves Emily’s friend, and the result is both “Susan” and “Christie”.

g.V('u006').both('friend').hasLabel('employee')

If you want to traverse until some condition matches, you can use repeat().until().
The following retrieves the reporting path (the relation of direct reports) from “John” to “Emily”.

g.V('u001').repeat(in('manager')).until(has('id', 'u006')).path()

The result is “John” – “Mary” – “Bob” – “Emily” as the following GraphSON.

{
  "labels": [
    ...
  ],
  "objects": [
    {
      "id": "u001",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "9a5c0e2a-1249-4e2c-ada2-c9a7f33e26d5",
            "value": "John"
          }
        ],
        "age": [
          {
            "id": "67d681b1-9a24-4090-bac5-be77337ec903",
            "value": 44
          }
        ]
      }
    },
    {
      "id": "u002",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "8d3b7a38-5b8e-4614-b2c4-a28306d3a534",
            "value": "Mary"
          }
        ],
        "age": [
          {
            "id": "2b0804e5-58cc-4061-a03d-5a296e7405d9",
            "value": 37
          }
        ]
      }
    },
    {
      "id": "u004",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "3b804f2e-0428-402c-aad1-795f692f740b",
            "value": "Bob"
          }
        ],
        "age": [
          {
            "id": "040a1234-8646-4412-9488-47a5af75a7d7",
            "value": 35
          }
        ]
      }
    },
    {
      "id": "u006",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "dfb2b624-e145-4a78-b357-5e147c1de7f6",
            "value": "Emily"
          }
        ],
        "age": [
          {
            "id": "f756c2e9-a16d-4959-b9a3-633cf08bcfd7",
            "value": 29
          }
        ]
      }
    }
  ]
}

Finally, let’s consider the shortest path from “Emily” to “John”. We assume that you can traverse either “manager” (directional) or “friend” (undirectional).

Now the following returns the possible paths from “Emily” to “John” connected by either “manager” (directional) or “friend” (undirectional).

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path()

This result is 3 paths :
Emily – Susan – John
Emily – Christie – Susan – John
Emily – Bob – Mary – John

When you want to count the number of each paths (current local elements), use count(local) operation.

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path().count(local)

This result is :
3
4
4

Then the following returns both count and paths as follows.

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path().group().by(count(local))
{
  "3": [
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u005",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    }
  ],
  "4": [
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u003",
          ...
        },
        {
          "id": "u005",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    },
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u004",
          ...
        },
        {
          "id": "u002",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    }
  ]
}

 

[Reference]

TinkerPop Documentation (including language reference)
http://tinkerpop.apache.org/docs/current/reference/

 

Analyze your data in Azure Data Lake with R (R extension)

Azure Data Lake (ADL), which offers the unlimited data storage, is the reasonable choice (or cost effective) for the simple batch-based analysis.
You remember the data is more critical rather than the program ! In the case of analyzing data in your Azure Data Lake Store, you don’t need to move or download your data into the remote host. You can run the python or R code on Azure Data Lake Analytics in the cloud hosted.

Here I show you how to use this R extensions with some brief examples along with the real scenarios.

Note : In this post we consider the simple batch-based scenario. If you need more advanced scenarios with the data in ADL store, please use ADL store with Hadoop (HDInsight) with R Server, Spark, Storm, etc.
See “Benefits of Microsoft R and R Server” in my previous post for more details.

Note : U-SQL development with Python and R is also supported in Visual Studio Code. See “ADL Tools for Visual Studio Code (VSCode) supports Python & R Programming” in team blog. (Added on Nov 2017)

Setting-up

Before starting, you must prepare your Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA) with Azure Portal. (Please see the Azure document.)

Next, on your Azure Data Lake Analytics blade, click [Sample Scripts], and select [Install U-SQL Extensions]. (See the following screenshot.)
It starts the installation of extensions in your Data Lake Analytics (ADLA).

Let’s see what kind of installation was made.

After installation is completed, please click [Success] and [Duplicate Script] button. (The installation is executed as Data Lake job.)

As you know, Data Lake Analytics is the .NET-based platform and you can extend using your own custom .NET classes.

R extension is the same. As you can see (see below) in this script job, the R extension classes (ExtR.dll) are installed in your Data Lake Analytics. (Note that the extensions of python and the extensions of cognitive services are also installed.)
As I show you later, you can use these installed classes in your U-SQL script.

Note : You can see these installed dll on /usqlext/assembly folder in your ADLS (Data Lake Store).

Let’s get started !

Now it’s ready.

You can find a lot of examples in /usqlext/samples/R on ADLS. (These are the famous iris classification examples.) You can soon run these U-SQL files (.usql files) with Azure Portal, Visual Studio, or Visual Studio Code (if using Mac), and see the result and how it works. (Here we use Visual Studio.)

For instance, the following is retrieving the data in iris.csv and analyzing for the prediction target “Species” with linear regression. (Sorry, but this sample is meaningless because it’s just returning the base64 encoded trained model. I show you some complete example later…)

R extension (ExtR.dll) includes the custom reducer (.NET class) named Extension.R.Reducer, then you can use this extension class with U-SQL REDUCE expression as follows.

REFERENCE ASSEMBLY [ExtR]; // Load library

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species <- as.factor(inputFromUSQL$Species)
lm.fit <- lm(unclass(Species)~.-Par, data=inputFromUSQL)
library(base64enc)
outputToUSQL <-
  data.frame(
    Model=base64encode(serialize(lm.fit, NULL)),
    stringsAsFactors = FALSE)
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

As you can see in this sample code, you can use inputFromUSQL for retrieving the input data in your R script. And you can use outputToUSQL as returned result to U-SQL. That is, your R script can communicate with U-SQL script by using these pre-defined variables.

Instead of using outputToUSQL, you can just write the result to the R output. For instance, you can rewrite the above example as follows. (I changed the source code with bold fonts.)

REFERENCE ASSEMBLY [ExtR]; // Load library

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species <- as.factor(inputFromUSQL$Species)
lm.fit <- lm(unclass(Species)~.-Par, data=inputFromUSQL)
library(base64enc)
#outputToUSQL <-
#  data.frame(
#    Model=base64encode(serialize(lm.fit, NULL)),
#    stringsAsFactors = FALSE)
data.frame(
  Model=base64encode(serialize(lm.fit, NULL)),
  stringsAsFactors = FALSE)
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

We used inline R script in the above example, but you can also separate the R script from your U-SQL script as follows. (See the line with bold fonts.)

REFERENCE ASSEMBLY [ExtR]; // Load library

DEPLOY RESOURCE @"/usqlext/samples/R/testscript01.R";

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    scriptFile:"testscript01.R",
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

Partitioning

By using REDUCE expression, you can separate your analysis workload by partitions. Each partitions can be executed in parallel, then you can efficiently predict some massive amount of data by using this partitioning capability.

To make things simple, let’s consider the following sample data. Here we use the first column as partition key.

test01.csv

1,1
1,2
1,3
1,4
2,5
2,6
2,7
2,8
3,9
3,10
3,11
3,12

The following is the brief example which is calculating min, max, and mean for each partitions.

REFERENCE ASSEMBLY [ExtR];

DECLARE @SrcFile string = @"/sampledat/test01.csv";
DECLARE @DstFile string = @"/sampledat/output01.txt";
DECLARE @myRScript = @"
outputToUSQL <- data.frame(
  CalcType = c(""min"", ""max"", ""mean""),
  CalcValue = c(
    min(inputFromUSQL$Value),
    max(inputFromUSQL$Value),
    mean(inputFromUSQL$Value)
  )
)
";

@ExtendedData =
  EXTRACT PartitionId int,
      Value int
  FROM @SrcFile
  USING Extractors.Csv();

@RScriptOutput = REDUCE @ExtendedData ON PartitionId
  PRODUCE PartitionId int, CalcType string, CalcValue double
  READONLY PartitionId
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

The following screenshot is the result of this U-SQL.
Each partition is executed independently in parallel, and all results are collected by REDUCE operation.

Note that you have to specify ON {partition keys (multiple)} or ALL when you’re using REDUCE clause. (You cannot skip ON / ALL.)
So if you don’t need partitioning, you specify the pseudo partition (one same partition for all raw) like the following script.

REFERENCE ASSEMBLY [ExtR];

DECLARE @SrcFile string = @"/sampledat/test01.csv";
DECLARE @DstFile string = @"/sampledat/output01.txt";
DECLARE @myRScript = @"
outputToUSQL <- data.frame(
  CalcType = c(""min"", ""max"", ""mean""),
  CalcValue = c(
    min(inputFromUSQL$Value),
    max(inputFromUSQL$Value),
    mean(inputFromUSQL$Value)
  )
)
";

@ExtendedData =
  EXTRACT SomeId int,
      Value int
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData2 =
  SELECT 0 AS Par, // pseudo partition
       *
  FROM @ExtendedData;

@RScriptOutput = REDUCE @ExtendedData2 ON Par
  PRODUCE Par int, CalcType string, CalcValue double
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

Installing packages

There are default supported packages in R extension, but you can install extra packages if needed. (See here for the default packages of R extension. It’s also including RevoScaleR package.)

First you download the package file (.zip, .tar.gz, etc) using your local R console. Now here we download the famous svm package “e1071”. (We assume the file name is e1071_1.6-8.tar.gz.)

download.packages("e1071", destdir="C:\tmp")

Next you upload this package file to the folder in your ADLS (Data Lake Store).
After that, you can specify this package file in your U-SQL and you can install this package in your R script as follows.

REFERENCE ASSEMBLY [ExtR];

DEPLOY RESOURCE @"/sampledat/e1071_1.6-8.tar.gz";

DECLARE @SrcFile string = @"/sampledat/iris.csv";
DECLARE @DstFile string = @"/sampledat/output03.txt";
DECLARE @myRScript = @"
install.packages('e1071_1.6-8.tar.gz', repos = NULL) # installing package
library(e1071) # loading package
# something to analyze !
# (Later we'll create the code here ...)
data.frame(Res = c(""result1"", ""result2""))
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Res string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

Loading R data

In the real scenario, you might use the pre-trained model for predictions. In such a case, you can create the trained model (R objects) beforehand, and you can load these R objects on your R script in U-SQL.

First you create the trained model using the following script in your local environment. The file “model.rda” will be saved in your local file system.
(Here we’re using script for saving, but you can also use RStudio IDE.)

library(e1071)
inputCSV <- read.csv(
  file = "C:\tmp\iris_train.csv",
  col.names = c(
    "SepalLength",
    "SepalWidth",
    "PetalLength",
    "PetalWidth",
    "Species")
)
mymodel <- svm(
  Species~.,
  data=inputCSV,
  probability = T)
save(mymodel, file = "C:\tmp\model.rda")

Note that we assume our  training data (iris data) is as follows. (It’s the same as U-SQL extension sample files…) :

iris_train.csv

5.1,3.5,1.4,0.2,setosa
7,3.2,4.7,1.4,versicolor
6.3,3.3,6,2.5,virginica
4.9,3,1.4,0.2,setosa
...

Then you upload this generated model (model.rda file) on the folder in your ADLS (Data Lake Store).

Now it’s ready, and let’s go jump into the U-SQL.

See the following R script in U-SQL.
This R script is loading the previous pre-trained model (model.rda). By this, you can use pre-trained R object “mymodel” in your R script.
All you have to do is to predict your input data with this model object.

REFERENCE ASSEMBLY [ExtR];

DEPLOY RESOURCE @"/sampledat/e1071_1.6-8.tar.gz";
DEPLOY RESOURCE @"/sampledat/model.rda";

DECLARE @SrcFile string = @"/sampledat/iris.csv";
DECLARE @DstFile string = @"/sampledat/output03.txt";
DECLARE @myRScript = @"
install.packages('e1071_1.6-8.tar.gz', repos = NULL)
library(e1071)
load(""model.rda"")
pred <- predict(
  mymodel,
  inputFromUSQL,
  probability = T)
prob <- attr(pred, ""probabilities"")
result <- data.frame(prob, stringsAsFactors = FALSE)
result$answer <- inputFromUSQL$Species
outputToUSQL <- result
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, setosa double, versicolor double, virginica double, answer string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

[Reference] Tutorial: Get started with extending U-SQL with R
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-r-extensions

 

Easy image classification and image search with pre-trained model on R

In my previous post, I described about featurizing (vectorizing) text by the sentiment analysis example.
In this post, I describe about the featurizing (vectorizing) image, and explain how you can apply in your real application.

As you can see here, you can take insights for images without any labeling or training yourself.

What is “featurizeImage” transformation

MicrosoftML package in the latest Microsoft R Client and Server (version 9.1 and later) includes some new functionalities with the pre-trained models. The new transform named “featurizeImage” is one of these.

This “featurizeImage” vectorizes the image by the model which is trained by the famous ImageNet dataset. As you know, the ImageNet is having the huge images including the labels of so many features like the name of objectives with boundary boxes, gestures, species, and other annotations. When using “featurizeImage”, you can select the residual network (ResNet) or convolutional network (AlexNet) for the pre-trained model. (In-effect, you cannot analyze with “featurizeImage” by the labels or attributes which is not in ImageNet dataset. For example, facial recognition with some unknown faces, etc)

You can download the nice sample code for this “featurizeImage” transformation from the GitHub (see here), and here I show you how you can use along with this sample code. (But I changed several lines of code for my example …)

Our Sample

Now here we use the following pictures which I’ve taken during my vacations before.
The picture #1 and #5 are the landmarks of ancient Greece. #4 and #7 are the house in shiny Santorini island. #2, #3, and #6 are the animals in the zoological park (in Tokyo) near my home town.

Vectorizing (Featurizing) Images

First, let’s just see the vectorized (featurized) results of these images. (In new MicrosoftML package, you can just transform and see the result.)
Please see the following code.

library("MicrosoftML")

orgdat <- data.frame(
  Image = c(
    "C:\tmp\001.JPG",
    "C:\tmp\002.JPG",
    "C:\tmp\003.JPG",
    "C:\tmp\004.JPG",
    "C:\tmp\005.JPG",
    "C:\tmp\006.JPG",
    "C:\tmp\007.JPG"),
  stringsAsFactors = FALSE)

vecdat <- rxFeaturize(
  data = orgdat,
  mlTransforms = list(
    loadImage(vars = list(Features = "Image")),
    resizeImage(vars = "Features", width = 224, height = 224),
    extractPixels(vars = "Features"),
    featurizeImage(var = "Features", dnnModel = "resnet101")   
  ))

As you can see, transformation steps are the followings :

  1. loadImage – loading images from each path name (in “Image” column)
  2. resizeImage – resizing all images to 244 X 244. Later I explain this background.
  3. extractPixels – extracting pixels (vector of number) from the loaded binary images.
  4. featurizeImage – vectorizing with pre-trained model. (CNTK wrapper seems to be used in the model evaluation.)

The result (“vecdat”) is having 2048 featurized columns (vectors), and these values are numeric like following.

In this example we’re using ResNet-101 network for the pre-trained model. In this case, the image size must be 244 X 244.
You can see the table of the input image size for each available networks in the blog post “Microsoft R blog : Image featurization with a pre-trained deep neural network model“.

ResNet-18 224 X 224
ResNet-50 224 X 224
ResNet-101 224 X 224
AlexNet 227 X 227

Image Classification with unsupervised classifier (Clustring)

You can use this featurized image vector for the various learnings. One good example is image classification. (Here we use clustring approach.)

Here, we now classify images using k-means clustering without any labeling or training.
That is, we take the unsupervised classifying approach. All you need is only vectorized features previously generated !

Note : In the GitHub example (ImageFeaturizer_TrainAndClassifyImage.R), it is taking the approach of labeling images (“Fish”, “Helicopter”, etc), training images, and finally getting the model for classification using the vectorized features. Here, we take another approach (clustering algorithm).

Let’s see the following example.
This code classifies by 3 classes depending on the distance of each image vectors (features).

library("MicrosoftML")

orgdat <- data.frame(
  Image = c(
    "C:\tmp\001.JPG",
    "C:\tmp\002.JPG",
    "C:\tmp\003.JPG",
    "C:\tmp\004.JPG",
    "C:\tmp\005.JPG",
    "C:\tmp\006.JPG",
    "C:\tmp\007.JPG"),
  stringsAsFactors = FALSE)

vecdat <- rxFeaturize(
  data = orgdat,
  mlTransforms = list(
    loadImage(vars = list(Features = "Image")),
    resizeImage(vars = "Features", width = 224, height = 224),
    extractPixels(vars = "Features"),
    featurizeImage(var = "Features", dnnModel = "resnet101")   
  ))

result <- kmeans(vecdat[, -1], 3, nstart = 20)

The result (the variable “result”) is as follows.
As you can see, each images are fairly well-classified along with your willings. (class #1 = ancient Greek landmarks, class #2 = the house of Santorini, class #3 = animals)

Now you can classify your albums (your real photos) without any labeling and training by yourself ! The pre-trained model makes you free from all these difficult works.

Image Matching (Image Search)

The GitHub sample code (ImageFeaturizer_FindMatchingImage.Ris also having the example for image matching (search). This sample also doesn’t need any labeling and training.
Let’s follow this sample code.

Before running, we prepare another Parthenon photo for search target as follows.

First, we use “dist” function and calculate the euclidean distance for each other.

library("MicrosoftML")

orgdat <- data.frame(
  Image = c(
    "C:\tmp\001.JPG",
    "C:\tmp\002.JPG",
    "C:\tmp\003.JPG",
    "C:\tmp\004.JPG",
    "C:\tmp\005.JPG",
    "C:\tmp\006.JPG",
    "C:\tmp\007.JPG"),
  stringsAsFactors = FALSE)

vecdat <- rxFeaturize(
  data = orgdat,
  mlTransforms = list(
    loadImage(vars = list(Features = "Image")),
    resizeImage(vars = "Features", width = 224, height = 224),
    extractPixels(vars = "Features"),
    featurizeImage(var = "Features", dnnModel = "resnet101")   
  ))

fnddat <- data.frame(
  Image = c("C:\Users\tsmatsuz\Desktop\searching.JPG"),
  stringsAsFactors = FALSE)

vec2dat <- rxFeaturize(
  data = fnddat,
  mlTransforms = list(
    loadImage(vars = list(Features = "Image")),
    resizeImage(vars = "Features", width = 224, height = 224),
    extractPixels(vars = "Features"),
    featurizeImage(var = "Features", dnnModel = "resnet101")   
  ))

distVals <- dist(
  rbind(vecdat, vec2dat)[,-1],
  "euclidean")

The result (distVals) is like following. This result is including all distance for 8 images (7 original images + 1 search target) each other. :

Here we just want the result of comparison with the search target (“searching.JPG”). That is, we need the result only on 8th row.

i <- attr(distVals, "Size") # value of "Size" must be 8
eucDist <- as.matrix(distVals)[i, -i]

The retrieved result (“eucDist”) is like following :

As you can see, the photo #1 (the following picture) is having the minimum distance with the search target (“searching.JPG”), because these 2 photos are resembling several features.

 

Another approach for classification or matching is one class support vector (OC-SVM). For example, if you want to identify whether the image is the animal or not, first you collect a lot of animal’s images, second you vectorize these images, and then you might get the appropriate model with one class svm. (But you must carefully select the kernel function and parameters in this case.)
As you see here, you can apply the featurized image data for the various scenarios.