Gradient descent and backpropagation

Before we start learning about what gradient descent and backpropagation have to do in the context of neural networks, let's learn what is meant by an optimization problem.

An optimization problem, briefly, corresponds to the following:

Minimizing a certain cost
Maximizing a certain profit

Let's now try to map this to a neural network. What happens if, after getting the output from a feed-forward neural network, we find that its performance is not up to the mark (which is the case almost all the time)? How are we going to enhance the performance of the NN? The answer is gradient descent and backpropagation.

We are going to optimize the learning process of the neural network with these two techniques. But what are we going to optimize? What are we going to minimize or maximize? We require a specific type of cost that we will attempt to minimize.

We will define the cost in terms of a function. Before we define a cost function for the NN model, we will have to decide the parameters of the cost function. In our case, the weights and the biases are the parameters of the function that the NN is trying to learn to give us accurate results (see the information box just before this section). In addition, we will have to calculate the amount of loss that the network is inculcating at each step of its training process.

For a binary classification problem, a loss function called a cross-entropy loss function (for a binary classification problem it is called a binary cross cross-entropy loss function) is widely used, and we are going to use it. So, what does this function look like?

Here, y denotes the ground truth or true label (remember the response variable, y, in the training set) of a given instance and denotes the output as yielded by the NN model. This function is convex in nature, which is just perfect for convex optimizers such as gradient descent.

This is one of the reasons that we didn't pick up a simpler and nonconvex loss function. (Don't worry if you are not familiar with terms like convex and nonconvex.)

We have our loss function now. Keep in mind that this is just for one instance of the entire set of data this is not the function on which we are going to apply gradient descent. The preceding function is going to help us define the cost function that we will eventually optimize using gradient descent. Let's see what that cost function looks like.

Here, w and b are the weights and biases that the network is trying to learn. The letter m represents the number of training instances, which is 10 in this case. The rest seems familiar. Let's put the original form of the function, L(), and see what J() looks like:

The function may look a bit confusing, so just slow it down and make sure you understand it well.

We can finally move toward the optimization process. Broadly, gradient descent is trying to do the following:

Give us a point where the cost function is as minimal as possible (this point is called the minima).
Give us the right values of the weights and biases so that the cost function reaches that point.

To visualize this, let's take a simple convex function:

Now, say we start the journey at a random point, such as the following:

So, the point at the top right corner is the point at which we started. And the point (directed by the dotted arrow) is the point we wish to arrive at. So, how do we do this in terms of simple computations?

In order to arrive at this point the following update rule is used:

Here, we are taking the partial derivative of J(w,b) with respect to the weights. We are taking a partial derivative because J(w,b) contains b as one of the parameters. ?? is the learning rate that speeds up this process. This update rule is applied multiple times to find the right values of the weights. But what about the bias values? The rule remains exactly the same only the equation is changed:

These new assignments of weights and biases are essentially referred to as backpropagation, and it is done in conjunction with gradient descent. After computing the new values of the weights and the biases, the whole forward propagation process is repeated until the NN model generalizes well. Note that these rules are just for one single instance, provided that the instance has only one feature. Doing this for several instances that contain several features can be difficult, so we are going to skip that part however, those who are interested in seeing the fully fledged version of this may refer to a lecture online by Andrew Ng.

We have covered the necessary fundamental units of a standard neural network, and it was not easy at all. We started by defining neurons and we ended with backprop (the nerdy term of backpropagation). We have already laid the foundations of a deep neural network. Readers might be wondering whether that was a deep neural net that we just studied. As Andriy Burkov says (from his book titled The Hundred Page Machine-Learning Book):

Deep learning refers to training neural networks with more than two non-output layers. … the term “deep learning” refers to training neural networks using the modern algorithmic and mathematical toolkit independently of how deep the neural network is. In practice, many business problems can be solved with neural networks having 2-3 layers between the input and output layers.

In the next sections, we will learn about the difference between deep learning and shallow learning. We will also take a look at two different types of neural networks—namely, convolutional neural networks and recurrent neural networks.