For your beginning of machine learning, here I show you some primitive overfitting example, and explain what you should care about and how to avoid. For building your intuitions, I show you several samples with many visual images.

First we see some simple overfitting examples for traditional statistical regression, and in the latter part we discuss about the case of neural network.

# First look for Overfitting

Let me explain about overfitting in machine learning with a brief example of dataset as follows. (See the following plotting of sample data.)

sampledat

The following R script is my regression by linear model for above dataset (sampledat).

To fit precisely in the given data, here we use the formula by `poly()`

function as follows. If the result is , then you think might be zero. You might think “the larger, the better”.

```
fit <- lm(
y ~ poly(x1, 10, raw = TRUE),
data = sampledat)
```

Here I show you the result for this regression analysis.

As you can see, we can get the following equation (1) as fitting equation for the given data.

Now we plot this equation with the given dataset. The equation (1) is best fitting with the given data as follows.

Is that really good result ?

In fact, this given dataset (sampledat) is generated by the following R script. As you can see below, this dataset is given by with some noise data. If the new data points are generated by this script, obviously these won’t fit into the equation (1).

That is, the result (equation (1)) is overfitting.

```
n_sample <- 20
b0 <- 3
b1 <- 1
x1 <- c(1:n_sample)
epsilon <- rnorm(
n = n_sample,
mean = 0,
sd = 0.7) # noise
y <- b0 + b1 * x1 + epsilon
sampledat <- data.frame(
x1 = x1,
y = y
)
```

Let’s see the equation (1) from your bird’s-eye. (See the following plotting of equation (1).)

The equation (1) is just fitting only for the given 20 data (the above “sampledat”), but not generalized one. The equation (1) doesn’t fit to unknown data.

Here we showed you a trivial overfitting example for your first understanding, but in the real practical case it’s difficult to distinguish whether it’s overfitting or not.

Our next interest is : How to distinguish ? How to avoid ?

# Information Criterion

Now let’s say, you add the extra parameter into your regression formula. But the result of likelihood has become just a slightly little improved, or almost the same. If so, you may think that this new parameter might not be needed for this regression formula.

In the statistical approach, there exists the criterion (called “Information Criterion”) to judge your model fitting based on the mathematical background.

The famous one is Akaike Information Criterion (AIC) as follows. The smaller value is better fitting.

where is the number of estimated parameters and is the maximum likelihood

Note : For both AIC and BIC (Bayesian information criterion) ideas, it’s given by . In AIC, equals 2 (constant value).

Here we have only 20 survey data (observations) and it’s difficult to understand whether it’s noise or not. With BIC, includes the number of survey data as parameter.

The following is the plot of values for log likelihood () and AIC for the previous given dataset (sampledat). The red line is the value of likelihood and blue line is AIC. (You can easily get these values with `logLik()`

and `AIC()`

function in R.)

As you can see, the appropriate number of estimated parameters is 3. That is, the formula is good for fitting. (See the following note.)

Note : The hidden estimated parameters (like the variance for Gaussian, the shape for Gamma, etc) must be counted as estimated parameters for AIC. In this case, we are using Gaussian, and **we must add the variance for the estimated parameters**. (See my previous post “Understanding the basis of GLM Regression” for details.)

For instance, if the formula (equation) is , then the estimated parameters are , and the variance (3 parameters).

For instance, if we use the following dataset, the number of parameters must be 4. Then the equation (formula) must be .

Here we use only single input (), but if you have several input parameters, you must also consider the interactions each other.

# Overfitting in neural networks

Let’s proceed to the neural nets for discussion.

First you must remember that too many layers or neurons often cause the overfitting. Especially the layer will affect the complexity so much. (Of course, though the layers must be deep enough to represent your working model.)

Here also it’s not “the larger, the better”.

To simplify our example, let’s say here is brief feed-forward neural nets by sigmoid with two input variables () and one binary output (the output between 0 and 1).

If we have 1 hidden layer, it can represent the model as following illustrated. (The model can have several linear boundaries and these combination.)

If we have 2 hidden layers, it can represent more complex models as following illustrated. (These are the combination of 1 layer’s models.)

Granting that we have some noise data, 2 hidden layers’ network might cause the overfitting as follows.

As you can see here, the large layers will cause the overfitting.

Note : Unlike the statistical approach there’s no concrete criterion to decide how much is the best for layers or neurons, because no common evaluation property based on the mathematical model is there.

You must examine and evaluate the generated model with test data or validation data.

The model complexity is also caused by the large coefficients. Let’s see the next example.

As you know, the sigmoid has the following linear part and binary part. The linear part can smoothly fit, but the binary one doesn’t (binary fit).

As weights are increased, the binary part becomes more stronger than the linear part.

For example, let’s see the following illustrated network.

This network results into the following plotting (wire frame). ( is inputs, and z is output.) As you can see, it’s smoothly transitioning.

Let’s see the following next example.

This network is having exactly same boundary as previous one, but the coefficients (weights and bias) are so large.

When we plot the inputs () and outputs (), it becomes more sharp than before.

As weights are increased and it has enough layers and neurons, the model can easily produce more complex models. As a result it causes overfitting and the lack of generalization.

Large coefficients are easily be generated.

You just learn with too many training iterations (inputs, epoch, etc). Train ! Train ! Train ! The coefficient’s growth is caused by gradient descent.

For instance, the following is the simple feed-forward nets for recognizing hand-writing digit by mxnetR. This script outputs the variance of each layer’s weights.

```
require(mxnet)
...
# configure network
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")
# train
model <- mx.model.FeedForward.create(
softmax,
X=traindata.x,
y=traindata.y,
ctx=mx.cpu(),
```**num.round = 10**,
learning.rate=0.07)
# dump weights and biases
params <- model$arg.params
# check the variance of weights in each layers
var(c(as.array(params$fc1_weight)))
var(c(as.array(params$fc2_weight)))
var(c(as.array(params$fc3_weight)))

When we set `num.round = 100`

(see the above bold font) in this script, we can get more distributed large coefficients as follows.

In this example we’re setting the appropriate learning rate and the number of layers by the experimental results, but the weights will more rapidly increase with more high values of these parameters.

Note : Learning rate also affects to the weights and accuracy so much. Learning rate must be enough small to constantly decrease the differences in gradient descent, but it must be enough large to make it converge rapidly as possible. (You must decide with your experimental survey.)

epoch = 10

epoch = 100

There exist several regularization techniques to mitigate these overfitting in neural nets as follows.

- Early Stopping – A method to stop learning when some condition occurs (ex: the condition when the error is higher than the last check, etc)
- Penalty (L1, L2) – A method to set the penalty term for avoiding weight’s increase (weight decay penalty) in gradient descent evaluation
- Dropout – A method to randomly drop the neurons in each training phase. By doing this, it avoids the overfitting of co-adaptation when it has so complex structure with many layers and neurons. As a result, it accomplishes the model combination (same like ensemble learning such as random forest etc).

The supported regularization method will differ from each framework.

For the libraries by Microsoft, you can implement all the regularization techniques (early stopping , penalty by L1 and L2, dropout) with CNTK, but rxNeuralNet (in MicrosoftML) and NET# doesn’t support.

```
# Dropout with CNTK (Python)
...
with default_options(activation=relu, pad=True):
model = Sequential([
LayerStack(2, lambda : [
Convolution((3,3), 64),
Convolution((3,3), 64),
MaxPooling((3,3), strides=2)
]),
LayerStack(2, lambda i: [
Dense([256,128][i]),
Dropout(0.5)
]),
Dense(4, activation=None)
])
...
```