DL Menu

Deep Feedforward Networks

Example: Learning XOR

The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2. When exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it returns 0. The XOR function provides the target function

y = f^∗(x) that we want to learn. Our model provides a function y = f(x; θ) and our learning algorithm will adapt the parameters θ to make f as similar as possible to f^∗.

We want our network to perform correctly on the four points X = {[0, 0], [0, 1],[1, 0], and [1, 1]}. We will train the network on all four of these points. The only challenge is to fit the training set.

We can treat this problem as a regression problem and use a mean squared error loss function. In practical applications, MSE is usually not an appropriate cost function for modeling binary data.

Evaluated on our whole training set, the MSE loss function is

J (θ) = \frac{1}{4} \sum_{x \in X}^{} (f^{*} (x) - f (x; θ))^{2}

Suppose that we choose a linear model, with θ consisting of w and b. Our model is defined to be

f (x; w, b) = x^{T} w + b

We can minimize J(θ) in closed form with respect to w and b using the normal equations.

After solving the normal equations, we obtain w = 0 and b = $\frac{1}{2}$ . The linear model simply outputs 0.5 everywhere. Why does this happen? A linear model is not able to represent the XOR function. One way to solve this problem is to use a model that learns a different feature space in which a linear model is able to represent the solution.

Specifically, we will introduce a very simple feedforward network with one hidden layer containing two hidden units.

This feedforward network has a vector of hidden units h that are computed by a function f⁽¹⁾(x; W, c). The values of these hidden units are then used as the input for a second layer. The second layer is the output layer of the network. The output layer is still just a linear regression model, but now it is applied to h rather than to x . The network now contains two functions chained together: h = f⁽¹⁾(x; W, c) and y = f⁽²⁾(h; w, b), with the complete model being $f (x; W, C, w, b) = f^{(2)} (f^{(1)} (x))$

What function should f⁽¹⁾ compute? Linear models have served us well so far, and it may be tempting to make f⁽¹⁾ be linear as well. Unfortunately, if f⁽¹⁾ were linear, then the feedforward network as a whole would remain a linear function of its input. we must use a nonlinear function to describe the features. Most neural networks do so using an affine transformation controlled by learned parameters, followed by a fixed, nonlinear function called an activation function. We use that strategy here, by defining h = g(W^T x + c), where W provides the weights of a linear transformation and c the biases.

We describe an affine transformation from a vector x to a vector h, so an entire vector of bias parameters is needed. The activation function g is typically chosen to be a function that is applied element-wise, with h_i = g(x^T W_i + c_i). In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU defined by the activation function g(z) = max{0,z}.

We can now specify our complete network as

$f (x; W, C, w, b) = w^{T} m a x {0, W^{T} x + c} + b$

We can now specify a solution to the XOR problem. Let

W = [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}]

c = [\begin{matrix} 0 \\ -1 \end{matrix}]

w = [\begin{matrix} 1 \\ -2 \end{matrix}]

and b = 0

We can now walk through the way that the model processes a batch of inputs. Let X be the design matrix containing all four points in the binary input space, with one example per row:

X = [\begin{matrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{matrix}]

The first step in the neural network is to multiply the input matrix by the first layer’s weight matrix:

X W = [\begin{matrix} 0 & 0 \\ 1 & 1 \\ 1 & 1 \\ 2 & 2 \end{matrix}]

Next, we add the bias vector c, to obtain

[\begin{matrix} 0 & -1 \\ 1 & 0 \\ 1 & 0 \\ 2 & 1 \end{matrix}]

In this space, all of the examples lie along a line with slope 1. As we move along this line, the output needs to begin at 0, then rise to 1, then drop back down to 0. A linear model cannot implement such a function. To finish computing the value of h for each example, we apply the rectified linear transformation:

[\begin{matrix} 0 & 0 \\ 1 & 0 \\ 1 & 0 \\ 2 & 1 \end{matrix}]

This transformation has changed the relationship between the examples. They no longer lie on a single line. They now lie in a space where a linear model can solve the problem. We finish by multiplying by the weight vector w:

[\begin{matrix} 0 \\ 1 \\ 1 \\ 0 \end{matrix}]

The neural network has obtained the correct answer for every example in the batch.

In this example, we simply specified the solution, then showed that it obtained zero error. In a real situation, there might be billions of model parameters and billions of training examples, so one cannot simply guess the solution as we did here. Instead, a gradient-based optimization algorithm can find parameters that produce very little error.

Next Topic :Gradient-Based Learning