DL Menu

Early stopping

Early stopping is a form of regularization used to avoid overfitting on the training dataset. Early stopping keeps track of the validation loss, if the loss stops decreasing for several epochs in a row the training stops.

The early stopping meta-algorithm for determining the best amount of time to train. This meta-algorithm is a general strategy that works well with a variety of training algorithms and ways of quantifying error on the validation set.


Let n be the number of steps between evaluations.
Let p be the "patience," the number of times to observe worsening validation set error before giving up.
Let θ_o be the initial parameters.
θ ← θ_o
i ← 0
j ← 0
v ← ∞
θ^* ← θ
i^* ← i
while j < p do
   Update θ by running the training algorithm for n steps.
   i ← i + n
   v^´ ← ValidationSetError(θ)
   if v^´ < v then
      j ← 0
      θ^* ← θ
      i^* ← i
      v ← v^´
   else
      j ← j + 1
   end if
end while
Best parameters are θ^* , best number of training steps is i^*

Algorithm determines the best amount of time to train. The meta algorithm is a general strategy that works well with a variety of training algorithms and ways of quantifying error on the validation set.

In a general learning algorithm, the dataset is divided into a training set and test set. After each epoch of the algorithm, the parameters are updated accordingly after understanding the dataset. Finally, this trained model is applied to the test set. Generally, the training set error will be less compared to the test set error. This is because of overfitting whereby the algorithm memorizes the training data and produces the right results on the training set. So the model becomes highly exclusive to the training set and fails to produce accurate results for other datasets including the test set. Regularization techniques are used in such situations to reduce overfitting and increase the performance of the model on any general dataset. Early stopping is a popular regularization technique due to its simplicity and effectiveness.

Regularization by early stopping can be done either by dividing the dataset into training and test sets and then using cross-validation on the training set or by dividing the dataset into training, validation and test sets, in which case cross-validation, is not required. Here, the second case is analyzed. In early stopping, the algorithm is trained using the training set, and the point at which to stop training is determined from the validation set. Training error and validation error are analyzed. The training error steadily decreases while the validation error decreases until a point, after which it increases. This is because, during training, the learning model starts to overfit the training data. This causes the training error to decrease while the validation error increases. So a model with better validation set error can be obtained if the parameters that give the least validation set error are used. Each time the error on the validation set decreases, a copy of the model parameters is stored. When the training algorithm terminates, these parameters which give the least validation set error are finally returned and not the last modified parameters.

In Regularization by Early Stopping, we stop training the model when the performance of the model on the validation set is getting worse-increasing loss or decreasing accuracy or poorer values of the scoring metric. By plotting the error on the training dataset and the validation dataset together, both the errors decrease with a number of iterations until the point where the model starts to overfit. After this point, the training error still decreases but the validation error increases. So, even if training is continued after this point, early stopping essentially returns the set of parameters that were used at this point and so is equivalent to stopping training at that point. So, the final parameters returned will enable the model to have low variance and better generalization. The model at the time the training is stopped will have a better generalization performance than the model with the least training error. Early stopping can be thought of as implicit regularization, contrary to regularization via weight decay. This method is also efficient since it requires less amount of training data, which is not always available. Due to this fact, early stopping requires lesser time for training compared to other regularization methods. Repeating the early stopping process many times may result in the model overfitting the validation dataset, just as similar as overfitting occurs in the case of training data.

The number of iterations taken to train the model can be considered a hyperparameter. Then the model has to find an optimum value for this hyperparameter (by hyperparameter tuning) for the best performance of the learning model.

Next Topic :Parameter Tying and Sharing