Convolutional neural networks

Have you ever uploaded a photo of your friends' group to Facebook? If yes, have you ever wondered how Facebook detects all the faces in the photo automatically just after the upload finishes? In short, the answer is convolutional neural networks (CNNs).

A feed-forward network generally consists of several fully connected layers, whereas a CNN consists of several layers of convolution, along with other types of sophisticated layers, including fully-connected layers. These fully-connected layers are generally placed towards the very end and are typically used for making predictions. But what kinds of predictions? In an image-processing and computer-vision context, a prediction task can encompass many use cases, such as identifying the type of object present in the image that is given to the network. But are CNNs only good for image-related tasks? CNNs were designed and proposed for image-processing tasks (such as object detection, object classification, and so on) but it has found its use in many text-processing tasks as well. We are going to learn about CNNs in an image-processing context because CNNs are most popular for the wonders they can work in the domains of image processing and computer vision. But before we move on to this topic, it would be useful to understand how an image can be represented in terms of numbers.

An image consists of numerous pixels and dimensions—height x width x depth. For a color image, the depth dimension is generally 3, and for a grayscale image, the dimension is 1. Let's dig a bit deeper into this. Consider the following image:

The dimension of the preceding image is 626 x 675 x 3, and numerically, it is nothing but a matrix. Each pixel represents a particular intensity of red, green, and blue (according to the RGB color system). The image contains a total of 422,550 pixels (675 x 626).

The pixels are denoted by a list of three values of red, green, and blue colors. Let's now see what a pixel (corresponding to the twentieth row and the hundredth column in the matrix of 422,550 pixels) looks like in coding terms:

12, 24, 10

Each value corresponds to a particular intensity of the colors red, green, and blue. For the purpose of understanding CNNs, we will look at a much smaller dimensional image in grayscale. Keep in mind that each pixel in a grayscale image is between 0 and 255, where 0 corresponds to black and 255 corresponds to white.

The following is a dummy matrix of pixels representing a grayscale image (we will refer to this as an image matrix):

Before we proceed, let's think intuitively about how can we train a CNN to learn the underlying representations of an image and make it perform some tasks. Images have a special property inherent to them: the pixels in an image that contain a similar type of information generally remain close to each other. Consider the image of a standard human face: the pixels denoting the hair are darker and are closely located on the image, whereas the pixels denoting the other parts of the face are generally lighter and also stay very close to each other. The intensities may vary from face to face, but you get the idea. We can use this spatial relationship of the pixels in an image and train a CNN to detect the similar pixels and the edges that they create in between them to distinguish between the several regions present in an image (in an image of a face, there are arbitrary edges in between the hair, eyebrows, and so on). Let's see how this can be done.

A CNN typically has the following components:

  • Convolutional layer
  • Activation layer
  • Pooling layer
  • Fully connected layer

At the heart of a CNN sits an operation called convolution (which is also known as cross relation in the literature of computer vision and image processing). Adrian Rosebrock of PyImageSearch describes the operation as follows:

In terms of deep learning, an (image) convolution is an element-wise multiplication of two matrices followed by a sum.

This quote tells us how an (image) convolution operator works. The matrices mentioned in the quote are the image matrix itself and another matrix known as the kernel. The original image matrix can be higher than the kernel matrix and the convolution operation is performed on the image matrix in a left–right top–bottom direction. Here is an example of a convolution operation involving the preceding dummy matrix and a kernel of size 2 x 2:

The kernel matrix actually serves as the weight matrix for the network, and to keep it simple, we ignore the bias term for now. It is also worth noting that our favorite image filters (sharpening, blurring, and so on) are nothing but outputs of certain kinds of convolution applied to the original images. A CNN actually learns these filter (kernel) values so that it can best capture the spatial representation of an image. These values can be further optimized using gradient descent and backpropagation. The following figure depicts four convolution operations applied to the image:

Note how the kernel is sliding and how the convoluted pixels are being calculated. But if we proceed like this, then the original dimensionality of the image gets lost. This can cause information loss. To prevent this, we apply a technique called padding and retain the dimensionality of the original image. There are many padding techniques, such as replicate padding, zero padding, wrap around, and so on. Zero padding is very popular in deep learning. We will now see how zero padding can be applied to the original image matrix so that the original dimensionality of the image is retained:

Zero padding means that the pixel value matrix will be padded by zero on all sides, as shown in the preceding image.

It is important to instruct the network how it should slide the image matrix. This is controlled using a parameter called stride. The choice of stride depends on the dataset and the correct use of stride 2 is standard practice in deep learning. Let's see how stride 1 differs from stride 2:

A convoluted image typically looks like the following:

The convoluted image largely depends on the kernel that is being used. The final output matrix is passed to an activation function and the function is applied to the matrix's elements. Another important operation in a CNN is pooling, but we will skip this for now. By now, you should have a good understanding of how a CNN works on a high level, which is sufficient for continuing to follow the book. If you want to have a deeper understanding of how a CNN works, then refer to the blog post at https://www.pyimagesearch.com/2018/04/16/keras-and-convolutional-neural-networks-cnns/.