Now I’m working with several ISVs for building their solutions with some machine learning algorithms (using R, Python, Azure Machine Learning, etc).

More advanced and realistic regression (like GLMM, Bayesian and MCMC simulation, etc) for traditional statistical approach will be often needed in the practical modeling, but it’s very important to understand the basic modeling ideas of GLM (generalized linear models) for your first start, because advanced regression techniques are also based on these basic ones.

In this post I summarize the essence of these basic modeling ideas with simple R scripts. Please build your intuitions for your first start.

Note : Here I don’t discuss about **bayesian** approach and it’s based on only frequentist approach. I don’t also discuss the **mixed model** (such as quasi-binomial, quasi-poisson, negative binomial, etc), or **non-parametric** approaches (using kernel functions, etc) in this post.

Here we discuss only basic ones by parametric and frequentist approaches for your first start.

# Common Idea for Regression (GLM)

All GLM family (Gaussian, Poisson, etc) is based on the following common idea.

- We choose the appropriate mathematical model (Gamma, Gaussian, etc) depending on what you need.

For example, when you want to predict the human height (response variable) for some school children with weight and age (explanatory variables), it might be better to choose Gaussian, because the human height will be on normal distribution (Gaussian distribution). - All model family is having the input parameters. Next we represent this input parameters with explanatory variables.

For example, the Gaussian distribution (normal distribution) is having the parameter (the mean) and (the standard deviation). If we suppose that is some constant value for any observation, is the only parameter depending on weight and age (explanatory variables). We suppose where is weight and is age. (The expression will differ from each cases. Later I’ll explain about the link function, and please proceed to read.) - Finally we estimate and determine one , , and for a lot of given data (weight, age, and human height).

For example, we can get the approximated probability of the human height (because is determined by , , , , and ), and finally you can estimate the appropriate , , and which maximizes the probability of actual height (observed response) with mathematical technique.

Now let’s see the each regression along with this common idea.

# Logistic Regression (family = binomial)

Binomial regression with logit link function is called “Logistic Regression”. Here I show you the brief outline.

Suppose it’s having the probability for each success occurrence. Then it’s known that times success occurrence among independent experiments (i.e, times failure) is having the following probability. (This probability space is called binomial distribution.)

where

is the function by and . But when and are given values, this is representing the function with unknown parameter .

where and are given values.

When a lot of sets of are given, we can define the probability with multiplication as follows. Because the multiplication of probability means the logical product.

Let’s consider the following case : Here we have sample small fishes in the fish tank, and we survey the survival rate depending on the amount of food and the amount of giving oxygen as follows.

```
Alive(y1) Dead(y2) Food(x1) Oxygen(x2)
39 11 300 2000
30 20 150 1200
12 38 100 1000
...
```

In previous function we’re supposing the parameter is always same. But, in this case, the parameter (which means the probability of survive) depends on food and oxygen: (This parameter is the linear predictor in the previous illustration.)

Then the model is written as follows. As you can see, this is the function with parameters for given .

That is, we should find the parameter which maximize .

In logistic regression, the following function is often used as instead of . This is called the logistic link function (strictly speaking, the inverse of the following function is called the link function), and **I describe about the link function later. (Please proceed to read here…)**

As a result, the estimation function of the logistic regression is written as follows :

Let’s see the following trivial example with R.

As you can see, we simulate with the value , and . The estimated results are -14.95783, 3.98255 and 1.99622. (See the following output.)

```
library(e1071) # for sigmoid func
#####
# create sample data with rbinom
#####
# y1 : number of alive
# y2 : number of dead
# x1 : food
# x2 : oxygen
#####
n_sample <- 1000
total <- 50
b0 <- -15
b1 <- 4
b2 <- 2
x1 <- runif(
n = n_sample,
min = 0,
max = 5)
x2 <- runif(
n = n_sample,
min = 0,
max = 5)
probs <- sigmoid(b0 + b1 * x1 + b2 * x2)
y1 <- rbinom(
n = n_sample,
size = total,
prob = probs)
y2 <- total - y1
sampledat <- data.frame(
x1 = x1,
x2 = x2,
y1 = y1,
y2 = y2
)
#####
# Estimate
#####
fit <- glm(
cbind(y1, y2) ~ x1 + x2,
data = sampledat,
family = binomial(link = "logit"))
summary(fit)
```

Note : The default link function of binomial is “logit”. Then you do not need to specify `link = "logit"`

explicitly. (Here I’m specifying explicitly for your understanding.)

Let’s see another example.

The following dataset is similar to the previous one (sampledat), but each row is having z column which identifies “Dead” (0) or “Alive” (1) for each small fishes, instead of and .

sampledatA

```
Survival(z) Food(x1) Oxygen(x2)
1 300 2000
0 100 1000
1 300 2000
0 100 1000
0 300 2000
1 300 2000
1 100 1000
...
```

This dataset is written by the following format, and it’s the same as previous one. Then you can analyze using the previous code.

sampledatB

```
Alive(y1) Dead(y2) Food(x1) Oxygen(x2)
3 1 300 2000
1 2 100 1000
...
```

Using R, you don’t need to convert the dataset and you can just write as follows using z (0 or 1) column.

```
## with "Survival" column z (0 or 1)
fit2 <- glm(
z ~ x1 + x2,
data = sampledatA,
family = binomial(link = "logit"))
summary(fit2)
```

If you’re using Microsoft R runtime, you can use `rxLogit()`

in RevoScaleR or rxFastLinear in MicrosoftML for scaling and high-performance execution as follows.

Here I don’t describe about details, but with `rxLogit()`

you can also specify the middle value in (0.7, 0,25, etc) for the response variable, and moreover you can use Frisch–Waugh–Lovell theorem (see Wikipedia) for categorical data.

```
fit3 <- rxLogit(
z ~ x1 + x2,
data = sampledat2)
summary(fit3)
```

# Poisson Regression (family = poisson)

Suppose some event occurs times in some time interval. Then the probability () of times occurrence of the same event in the same interval is known as follows. (This is called Poisson distribution.)

Let’s consider the following example. : Here we’re planting some seeds in your garden, and we survey how many seeds germinate depending on the amount of water and fertilizer.

In this model, we consider that (the average of germinating seeds) depends on the water () and fertilizer (). That is, the parameter is the linear predictor.

In Poisson Regression, the following function is often used as instead. The inverse of this function (i.e, ) is called the link function, and I describe the details about the link function later. (Please proceed to read here.)

Suppose a lot of data of , and are given as follows. (i.e, )

```
Germinating(k) Water(x1) Fertilizer(x2)
11 300 2000
8 150 1200
5 100 1000
...
```

Then we can define the total probability by multiplication as follows. (See above for the reason of multiplication.)

Here we should find the parameters which maximize the following possibility by the maximum likelihood estimation.

Note : In most cases, (i.e, log-likelihood) is used for the estimation, instead of using .

Let’s see the following trivial example with R.

As you can see, we simulate with the value , and , and the estimated results are 0.215383, 0.498815 and 0.402523.

```
#####
# create sample data with rpois
#####
# y : germinating
# x1 : water
# x2 : fertilizer
#####
n_sample <- 1000
b0 <- 0.2
b1 <- 0.5
b2 <- 0.4
x1 <- runif(
n = n_sample,
min = 0,
max = 5)
x2 <- runif(
n = n_sample,
min = 0,
max = 5)
lamdas <- exp(b0 + b1 * x1 + b2 * x2)
y <- rpois(
n = n_sample,
lambda = lamdas)
sampledat <- data.frame(
x1 = x1,
x2 = x2,
y = y
)
#####
# Estimate
#####
fit <- glm(
y ~ x1 + x2,
data = sampledat,
family = poisson(link = "log"))
summary(fit)
```

Note : The default link function of poisson is “log”. Then you do not need to specify `link = "log"`

explicitly.

If you’re using Microsoft R runtime, you can use rxGlm in RevoScaleR for scaled execution.

```
fit2 <- rxGlm(
y ~ x1 + x2,
data = sampledat,
family = poisson(link = "log"))
summary(fit2)
```

# Linear Regression (family = gaussian)

This regression is straight forward. This regression is represented by the following linear expression.

In the previous binomial and poisson regression, maximum likelihood estimation (MLE) is used for the parameter estimation.

But here, we can use the ordinary least squares (OLS) as follows. (The following is assuming that the only is used for explanatory variables.) Here we estimate the difference between the actual value and predicted value. The difference might be positive and negative. Then we calculate the square for each differences.

We should find the parameters which minimize the following E.

Why is it called “Gaussian” ?

Let’s consider the normal distribution (Gaussian distribution) . Suppose the standard deviation (the square root of variance) is some constant value for any observation, then this probability only depends on the parameter (the average).

Let’s consider the maximum likelihood estimation (MLE) for this Gaussian distribution. Suppose , and is having some small rage with . ( is fixed value.) If is very small, the product of likelihood will be defined as follows. (See above for the reason of multiplication.) :

Now we apply the logarithm for this likelihood (log-likelihood) as follows :

We can get the maximum value with this log-likelihood when we minimize . (Because and are constants.)

That is, the maximum likelihood estimation (MLE) for Gaussian distribution will result in the ordinary least squares (OLS) for the linear regression.

For example, it is known that the human height is on the with some fixed and . Suppose the average height relies on human weight () and we suppose it’s having the linear relation.

In this case, you can get the regression result by OLS approach with .

Let’s see the following trivial example with R.

As you can see, we simulate with the value and , and the estimated results are 44.085248 and 0.139133.

```
#####
# create sample data
#####
# y : height (inch)
# x1 : weight (pond)
#####
n_sample <- 1000
b0 <- 44
b1 <- 0.14
x1 <- runif(
n = n_sample,
min = 45,
max = 180)
epsilon <- rnorm(
n = n_sample,
mean = 0,
sd = 1.57) # noise (variance)
y <- b0 + b1 * x1 + epsilon
sampledat <- data.frame(
x1 = x1,
y = y
)
#####
# Estimate
#####
fit <- glm(
y ~ x1,
data = sampledat,
family = gaussian(link = "identity"))
summary(fit)
```

Note : `link="identity"`

means the link function with . That is, the specific link function is not used. (The default link function of Gaussian is “identity”. Then you do not need to specify `link = "identity"`

explicitly.)

If you’re using Microsoft R runtime, you can use rxLinMod in RevoScaleR or rxFastLinear in MicrosoftML for scaled and high-performance execution.

```
fit2 <- rxLinMod(
y ~ x1,
data = sampledat)
summary(fit2)
```

# Gamma Regression (family = Gamma)

Suppose some event occurs times in unit (1) interval. Then the probability density function () for interval () with times occurrence of the same event is known as follows :

where is Gamma function. ( for any positive integer, and it is an extension for real and complex number. See “Wikipedia : Gamma function“.)

Note : You must keep in mind that it’s not probability itself, but it’s probability density function. When you estimate with MLE, you must use integral or get the probability with some fixed small range as I described above (see Gaussian Regression).

This regression is often used for the analytics for waiting time, time between failure, accident rate or something like that.

There exist two parameters (called “shape” and “rate” respectively), but we suppose is the same (constant) for any observation in this regression.

For example, let’s consider the time to the failure (breakdown) for some equipment, and suppose this occurrence depends on the number of initial bugs () and the number of claims by customers (). Suppose 1 failure (breakdown) is caused by 10 times errors. When the number of initial bugs increases, the errors in production would also increase. That is, when (the number of initial bugs) increases, (the count of errors for failure) is always constant, but (the count of errors for unit interval) increases. On the contrary, the average time decreases.

The linear predictor in Gamma regression is the average time , and it’s proportional to as a result. ( is called “scale”.)

Then the following inverse link function is often used for Gamma regression. (I describe the details about the link function later.)

Note : Here I don’t explain about details, but (shape) can also be estimated. (The variance of Gamma distribution is known as .)

Let’s see the following trivial example with R.

As you can see, we simulate with the value , and , and the estimated results are 0.504847, 0.021330 and 0.008929.

```
#####
# create sample data with rgamma
#####
# y : time to failure
# x1 : number of initial bugs
# x2 : number of customer claims
#####
n_sample <- 1000
shape <- 10
b0 <- 0.5
b1 <- 0.02
b2 <- 0.01
x1 <- runif(
n = n_sample,
min = 1,
max = 5)
x2 <- runif(
n = n_sample,
min = 11,
max = 15)
rate <- (b0 + b1 * x1 + b2 * x2) * shape
y <- rgamma(
n = n_sample,
shape = shape,
rate = rate)
sampledat <- data.frame(
x1 = x1,
x2 = x2,
y = y
)
#####
# Estimate
#####
fit <- glm(
y ~ x1 + x2,
data = sampledat,
family = Gamma(link = "inverse"))
summary(fit)
```

When you want to get the shape (s), you can use the following `gamma.shape()`

in MASS package. (Then you can reproduce the original model by Gamma distribution.)

```
library(MASS)
...
# you can get shape as follows.
test <- gamma.shape(fit, verbose = TRUE)
```

If you’re using Microsoft R runtime, you can use rxGlm in RevoScaleR for scaled execution.

```
fit2 <- rxGlm(
y ~ x1 + x2,
data = sampledat,
family = Gamma(link = "inverse"))
summary(fit2)
```

# About Link Functions

Now it’s time to expalin about the idea of the link function.

In this post I focus on the functions which we used in the previous example.

**Log Link Function**

Remember the previous Poisson Regression example.

Suppose the linear predictor (the average of germinating) as follows :

Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?

Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times.

Thus (1) is not representing the actual model.

Let’s compare with the following model. This is called log link, because the linear predictor () equals to .

When we replace with , , , we can get the following model (2).

As you can see, (1) is growing by the addition, but (2) is by the multiplication. (With the model (1), it could be negative number when or decreses. But it won’t be with model (2).)

The absolute answer doesn’t exist (it is case by case), but you must understand the difference of properties of these two models and use each link function along with your actual modeling.

**Logit Link Function**

The logit link is used for representing the values which is 0 or 1 (or in the middle between 0 and 1). Unlike log link, this doesn’t exceed 1 for any explanatory variables. This idea is also used in the neural networks or other recognition algorithms in mathematics.

Remember the previous Logistic Regression example. Now we consider the following logical model. :

- Each input (food), (oxygen) is having the weight respectively.
- Get the total weights by
- Compare with some fixed threshold b. If it’s exceed b, the result is 1. Otherwise the result is 0.

This model is representing the aspect of the actual decision mechanism. But one important weakness is, it is not continuous. (It’s discrete.)

If we suppose , the result is plotted as follows. As you can see, we cannot treat approximation approach, because it’s not continuous.

Therefore we introduce the following function called “sigmoid”.

As you can see below, sigmoid is very similar to the previous one, but it’s continuous.

Now we suppose , and we can get the probability (probability of alive) as follows.

The logit link function is the inverse of this sigmoid. :

Note : As you can see, this is representing the probability of dead. This ratio of is called odds.

Here I don’t go so far, but this idea is used for other cases such as :

- You can apply
`tanh()`

for the same effect (continuous results) in the result (y) of range (-1, 1), instead of (0, 1). - You can apply
**softmax function**for the multiple (> 2) classification more than 2 classification. (Softmax is often used for the neural network’s activation function.)

**Inverse Link Function**

is called the inverse link function, where p is the predicted parameter.

Remember the previous Gamma Regression example. The predicted parameter is where s is shape and r is rate, and s is constant. Therefore this is the same meaning as follows :

where

As I explained in Gamma regression, r (rate) means “the occurrence count in unit (1) interval”. Then means “the average interval between occurrence”. This is called “scale”.

If the linear predictor is proportional to the scale (i.e, inversely proportional to the rate), you can specify identity link function (i.e, no specific link function is used) for Gamma Regression as follows instead of inverse link function.

```
fit <- glm(
y ~ x1 + x2,
data = sampledat,
family = Gamma(link = "identity"))
```

# Offset

When you want to specify some input variable without predicted coefficient like , you can use offset.

For instance, now we introduce new input variable “observation count” () in the previous example of Poisson Regression.

Then the following linear predictor will be used.

In such a case, you can use offset as follows.

```
fit <- glm(
y ~ x1 + x2,
data = sampledat,
family = poisson(link = "log"),
offset = log(x3))
```

You might think you can just divide the result by instead of using offset. But you can analyze without changing the variance or some other statistical attributes by using offset.

# Further More …

Here I just showed you each simple mathematical modeling, but the real situation is more complicated and combined with some elements and you need more realistic modeling like mixed models, Gaussian mixture modeling, etc.

Or suppose you try simple coin tossing (Bernoulli distribution) several times and it happened all is head ! In such a case, you will get wrong model which tells 100% “head” for coin tossing trials. With Bayesian approach, these kind of overfitting is suppressed and you can get more reasonable analytics by applying prior probabilities (you know it’s about half possibility for the coin toss trials !).

As I wrote in previous note, we didn’t discuss these advanced techniques here, but this basic ideas will help you understand others topic further more.

Categories: Uncategorized