If you’re a machine learning beginner, you might think “complex models can represent a real world more accurately” or “a lot of training makes models better and better”. But there exist several harmful aspects for these ideas.
Here I describe how these ideas badly affect your model from the point of view of mathematical consideration. For building your intuitions, here I won’t use difficult mathematical representations, but explain with a lot of visualization as possible.
First we see the primitive overfitting examples with traditional statistical regression, and in the latter part we discuss about the case of neural network.
1. Overfitting in Statistical Modeling (Example Case with Regression)
Model Complexity Mismatch
Let me explain about overfitting in machine learning with a brief example of dataset as follows. See the following plotting of sample data.
The following R script is a regression example for above dataset (sampledat).
To fit precisely in the given data, here we use a linear model with the formula by
poly() function as follows. If the result is , then you might think would be zero. You might think “the more, the merrier”.
fit <- lm( y ~ poly(x1, 10, raw = TRUE), data = sampledat)
But this idea makes the result getting worse.
Here I show you the result for this regression analysis.
As you can see, we now get the following equation (1) fitting for the given data.
Now let us plot this equation with the given dataset. As you can see below, the equation (1) is best fitting with the given data as follows.
Figure : Real Overfitting by Linear Model (10 dimension’s polynomial)
Is that really good result ?
In fact, this given sample data is generated by the following R script.
As you can see below, this dataset is given by with some noise data. If the new data is generated by this script, obviously above equation (1) won’t fit for this given data.
That is, the equation (1) doesn’t fit to unknown data and the result (equation (1)) is overfitting.
n_sample <- 20 b0 <- 3 b1 <- 1 x1 <- c(1:n_sample) epsilon <- rnorm( n = n_sample, mean = 0, sd = 0.7) # noise y <- b0 + b1 * x1 + epsilon sampledat <- data.frame( x1 = x1, y = y )
Let us extend the range of horizontal axis (x) to 100 and see the equation (1) from your bird-eye’s view. (See the following plotting of equation (1).) This is obviously not expected one.
The equation (1) is just fitting only for the given 20 data (the above “sampledat”), but not generalized one.
Figure : Previous Model with Bird-Eye’s View (Not Expected Result)
Here we showed you a trivial overfitting example for your first understanding, but in the real practical case it’s difficult to distinguish whether it’s overfitting or not.
Our next concern is : How to detect ? How to avoid ?
Note : When dimension is getting larger, the values of coefficients also tend to be large and it will also be one of causes for overfitting.
Later we discuss about the relation between large coefficients and overfitting. (“Overfitting”, “model complexity”, and “large coefficients” has relationships each others in both statistical modeling and neural networks.)
How to Avoid in Statistical Models ? – Information Criterion
Now let’s imagine that you add the extra parameter (above ) into your regression formula. As you can easily see, the result of likelihood will improve just a slightly little or almost the same. (See my early post “Understanding the basis of GLM Regression” for details about “likelihood”.)
If so, you may think “this new parameter might not be needed for this regression formula”.
In the statistical approach, there exists the criterion called “Information Criterion” to judge this consideration based on the mathematical idea.
Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) are often used and here I show you with simple AIC.
The following is the equation for AIC. is the number of estimated parameters and is the maximum likelihood. This equation tells : the smaller value is better fitting.
Note : For both AIC and BIC (Bayesian information criterion) ideas, it’s given by . In AIC, equals 2 (constant value).
Here we have only 20 survey data (observations) and it’s difficult to understand whether it’s noise or not. With BIC, includes the number of survey data as parameter.
The following is the plot of values for log likelihood () and AIC for the previous given sample data (sampledat). The red line is the value of likelihood and blue line is AIC. (You can easily get these values with
AIC() in R functions.)
As you can see, the appropriate number of estimated parameters is 3. That is, the formula is good for fitting. (See the following note.)
Note : The hidden estimated parameters (like the variance for Gaussian, the shape for Gamma, etc) must be counted as estimated parameters for AIC. In this case, we are using Gaussian, and we must add the variance for the estimated parameters. (See my previous post “Understanding the basis of GLM Regression” for details.)
For instance, if the formula (equation) is , then the estimated parameters are , and the variance (total 3 parameters).
Let’s see another example. For instance, if we use the following dataset, the number of parameters must be 4. Then the equation (formula) must be . (Actually, the following data is generated by 2 dimension’s equation with noise.)
Here in this example, we use only single input (), but if you have several input parameters, you must also consider the interactions each other.
2. Overfitting in Neural Networks
How Model Complexity Occurs in Neural Networks
Let’s proceed our discussion to the neural networks (deep learning models).
Same like previous regression example, the overfitting for neural networks is also due to the complicated model. However, there exist other root causes for occurrence in neural networks. (We see later for these aspects.)
For the simplicity of explanation, here we discuss using trivial feed-forward (fully connected) neural networks, but it’s essentially the same for other types of models.
First, as you can easily imagine, too many layers or neurons causes the overfitting. Especially the layer will affect the complexity so much. (On the contrary, the simple model with few layers will cause underfitting. The layers must be appropriate to represent your working model.)
Here also it’s not “the more, the merrier”.
To simplify our example, let’s say here is a brief feed-forward neural network by sigmoid with only two input variables () and one binary output (the output between 0 and 1).
If we have 1 hidden layer, it can represent the model as following illustrated. (The model can have several linear boundaries and these combination.) The gray one means result “1” and others are result “0”.
Figure : Neural Network Representation by 1 Hidden Layer
If we have 2 hidden layers, it can represent more complex models as following illustrated. These are the combination of above 1 layer’s models.
Figure : Neural Network Representation by 2 Hidden Layer
As you can see below, the neural network with 2 hidden layers might easily cause the overfitting as follows. (In the following picture, a circle means a valid data and a cross means a noise data.)
Figure : Intuition for Neural Network Overfitting (1)
Same like this, the neural network with 3 hidden layers consists of combination of 2 layers. If the layer increases, the complexity of model representations grows rapidly.
Unlike the statistical approach there’s no concrete criterion to decide how much is the best for layers or neurons, because no common evaluation property based on the mathematical model is there.
One of possible ways for avoiding this kind of inappropriate models is hyper-parameter tuning. You can examine and evaluate the generated model with test data or validation data (which is not used for training). If it’s not expecting results, change the number of layers or neurons and try again. (Repeat this iterations.)
Large Coefficients Preventing Model Smoothness
Next you should remember that the model complexity is also caused by the large coefficients. Let’s see the next example.
As you know, the sigmoid function has the following linear part and binary part. The linear part can smoothly fit, but the binary one doesn’t.
As weights are increased, the binary part becomes more stronger than the linear part.
Figure : Neural Network includes both Binary Part and Linear Part
Now let’s see using the actual networks as follows.
Figure : Neural Network Model 1
This network (Model 1) results into the following plotting (wire frame). Here is inputs, and z-axis is output.
As you can see, it’s smoothly transitioning by the effects of linear part in sigmoid function.
Figure : Model 1 is Smoothly Transitioning
Let’s see the following next example.
This network is having the exact same boundary as previous one, but the coefficients (weights and bias) are larger than Model 1.
Figure : Neural Network Model 2
When we plot the inputs () and outputs (), it becomes more sharp than before.
Figure : Model 2 is Sharp ! (Though Model 2 is having same boundary as Model 1)
As weights are increased and it has enough layers and neurons, the model can easily produce more complex models. Eventually it causes overfitting and the lack of generalization.
The following shows the intuition for overfitting caused by large coefficients.
Figure : Intuition for Neural Network Overfitting (2)
Harmful Effects of Too Many Training
How to occur this kind of overfitting ?
In fact, large coefficients are easily be generated.
You just learn with too many training iterations (epoch) for the same training inputs. Each training iterations fits more precisely by slightly increasing coefficients (the coefficient’s growth is caused by gradient descent), and this will result into the complicated model representation.
Less training iterations causes poor model, i.e underfitting, but too many training causes overfitting on the contrary.
The moderate training (to fit the number of variations of training data) is needed for the best model. (Avoid “Train ! Train ! Train !”)
For instance, the following is the simple MXNetR example with feed-forward networks for recognizing hand-writing digit. At the end of code, this script outputs the variance of each layer’s weights.
require(mxnet) ... # configure network data <- mx.symbol.Variable("data") fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128) act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu") fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64) act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu") fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10) softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm") # train model <- mx.model.FeedForward.create( softmax, X=traindata.x, y=traindata.y, ctx=mx.cpu(), num.round = 10, learning.rate=0.07) # dump weights and biases params <- model$arg.params # check the variance of weights in each layers var(c(as.array(params$fc1_weight))) var(c(as.array(params$fc2_weight))) var(c(as.array(params$fc3_weight)))
As you can see the following results, we get more distributed (large) coefficients when we modify
num.round = 100 (see the above bold font).
In this example we’re setting the appropriate learning rate and the number of layers, but the weights rapidly increase by changing this parameter.
Note : Learning rate also affects to the weights and accuracy so much. Learning rate must be enough small to constantly decrease the differences in gradient descent, but it must be enough large to make it converge rapidly as possible.
You can decide with a lot of experimental survey. (So called hyper-parameter tuning.)
epoch = 10
epoch = 100
Some Techniques for Avoiding Overfitting in Neural Networks
So how can we detect and how can we avoid ? Here in this post, I show you the brief outline for this techniques.
Of course, you can easily find out overfitting with validation data or test data (that is, which is the data not used in training phase). If the accuracy is getting worse with validation data (or the loss is growing with validation data), it could be overfitting.
However there also exist several regularization techniques to mitigate these overfitting in neural networks as follows.
- Early Stopping :
A method to stop learning when some condition occurs (ex: the condition when the error is higher than the last check, etc). As you can see above, this is the technique for preventing too many training.
- Penalty (L1, L2) :
A method to set the penalty term for avoiding weight’s increase (weight decay penalty) in gradient descent evaluation.
- Dropout :
A method to randomly drop the neurons in each training phase. It’s based on the model combination same like ensemble learning (such as random forest etc), and then prevent the overfitting using co-adaptation.
The supported regularization method will differ from each frameworks (TensorFlow, MXNet, etc) and see each framework document for details.
For instance, you can implement these all the regularization techniques (early stopping , penalty by L1 and L2, dropout) with CNTK as follows, but rxNeuralNet in MicrosoftML doesn’t support. (Net# itself doesn’t support.)
# Dropout with CNTK (Python) ... with default_options(activation=relu, pad=True): model = Sequential([ LayerStack(2, lambda : [ Convolution((3,3), 64), Convolution((3,3), 64), MaxPooling((3,3), strides=2) ]), LayerStack(2, lambda i: [ Dense([256,128][i]), Dropout(0.5) ]), Dense(4, activation=None) ]) ...
3. Appendix – The Challenges for Model Explainability
This is a different topics and techniques, but closely related in terms of preventing wrong models.
As you saw in previous examples, “accuracy” is not the complete indicator for evaluating your generated model.
Now I show you other famous example. (See original paper for details.)
In this example, we assume that there exists some model for classifying husky and wolf with high accuracy. But what if this model classify by detecting background snow image ? Should we trust this model in the real production ?
Figure : Wolf, but classified as husky (right side is explanation) – From “Explaining the Predictions of Any Classifier”
As this example shows, there exist several other reasons (than overfitting) for invalid models, such as the bias for training data, model mismatching, and so on and so forth. This example is just for images, but also the same is for tabular and text.
For the purpose of finding these invalid learning, recently, there exist some techniques for explaining model validity.
What is the meaning of model explanation ?
As you know, you can use several models which is easily understood and explained by humans, such as linear classifier or decision tree classifier. Then these models are good and other complicated models are bad ?
Of course, Not.
As I mentioned above, neural networks can represent more complicated model. If the real world is not simple, it’s better for you to use the corresponding models for the real problems.
The challenges for this model explanation (model interpretability) is to understand whether you can trust your model or not, even when using more complicated models, such as random forest, SVM, or neural networks.
For instance, in the previous example (classifier for husky and wolf), you could find model invalidity by masking or replacing background snow images and comparing each accuracy.
Here I don’t go so far about model explanation (model interpretability) techniques, but the idea is depending on simplifying the model to explain by local approximation (in which the human can decide whether it’s good or bad) by showing which features are how effective in this local approximation (how much it affects ?, positive or negative ?, etc).
Figure : Intuition for black-box model explanation “LIME” – From “Explaining the Predictions of Any Classifier”
If you’re interested in model interpretability, see the document of each realizing technologies, such as LIME, SHAP (these are black-box model explanation) or ELI5 (which is white-box model explanation).
Update History :
May 2019 Added description for model explanation