The choice of cost function is tightly coupled with the choice of output unit. Most of the time, we simply use the cross-entropy between the data distribution and the model distribution. The choice of how to represent the output then determines the form of the cross-entropy function.
Linear output layers are often used to produce the mean of a conditional Gaussian distribution:
Maximizing the log-likelihood is then equivalent to minimizing the mean squared error. The maximum likelihood framework makes it straightforward to learn the covariance of the Gaussian too, or to make the covariance of the Gaussian be a function of the input. However, the covariance must be constrained to be a positive definite matrix for all inputs. It is difficult to satisfy such constraints with a linear output layer, so typically other output units are used to parametrize the covariance. Because linear units do not saturate, they pose little difficulty for gradientbased optimization algorithms and may be used with a wide variety of optimization algorithms.
Many tasks require predicting the value of a binary variable y . Classification problems with two classes can be cast in this form. The maximum-likelihood approach is to define a Bernoulli distribution over y conditioned on x. A Bernoulli distribution is defined by just a single number. The neural net needs to predict only P ( y = 1 | x). For this number to be a valid probability, it must lie in the interval [0, 1].
Satisfying this constraint requires some careful design effort. Suppose we were to use a linear unit, and threshold its value to obtain a valid probability:
We omit the dependence on x for the moment to discuss how to define a probability distribution over y using the value z. The sigmoid can be motivated by constructing an unnormalized probability distribution , which does not sum to 1. We can then divide by an appropriate constant to obtain a valid probability distribution. If we begin with the assumption that the unnormalized log probabilities are linear in y and z, we can exponentiate to obtain the unnormalized probabilities. We then normalize to see that this yields a Bernoulli distribution controlled by a sigmoidal transformation of z:
When we use other loss functions, such as mean squared error, the loss can saturate anytime saturates. The sigmoid activation function saturates to 0 when z becomes very negative and saturates to 1 when z becomes very positive. The gradient can shrink too small to be useful for learning whenever this happens, whether the model has the correct answer or the incorrect answer. For this reason, maximum likelihood is almost always the preferred approach to training sigmoid output units.
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes
In the case of binary variables, we wished to produce a single numberBecause this number needed to lie between 0 and 1, and because we wanted the logarithm of the number to be well-behaved for gradient-based optimization of the log-likelihood, we chose to instead predict a number .
To generalize to the case of a discrete variable with n values, we now need to produce a vector , with . We require not only that each element of be between 0 and 1, but also that the entire vector sums to 1 so that it represents a valid probability distribution.
A linear layer predicts unnormalized log probabilities:
where . The softmax function can then exponentiate and normalize z to obtain the desired . Formally, the softmax function is given by
As with the logistic sigmoid, the use of the exp function works very well when training the softmax to output a target value y using maximum log-likelihood.