Understanding convolution
Convolution is a way of extracting features from an image that may allow us to more easily classify it based on known features. Before we get into convolution, let's first take a step back and understand why networks, and our vision for that matter, need to isolate features in an image. Take a look at the following; it's a sample image of a dog, called Sadie, with various image filters applied:
Example of an image with different filters applied
The preceding shows four different versions with no filter, edge detection, pixelate, and glowing edges filters applied. In all cases, though, you as a human can clearly recognize it is a picture of a dog, regardless of the filter applied, except note that in the edge detection case, we have eliminated the extra image data that is unnecessary to recognize a dog. By using a filter, we can extract just the required features our NN needs to recognize a dog. This is all a convolution filter does, and in some cases, one of those filters could be just a simple edge detection.
A convolution filter is a matrix or kernel of numbers that defines a single math operation. The process starts by being multiplied by the upper-left corner pixel value, with the results of the matrix operation summed and set as the output. The kernel is slid across the image in a step size called a stride, and this operation is demonstrated:
In the preceding diagram, a stride of 1 is being used. The filter being applied in the convolution operation is essentially an edge detection filter. If you look at the result of the final operation, you can see the middle section is now filled with OS, greatly simplifying any classification task. The less information our networks need to learn, the quicker they will learn and with less data. Now, the interesting part of this is that the convolution learns the filter, the numbers,or the weights it needs to apply in order to extract the relevant features. This is not so obvious and may be confusing, so let's go over it again. Go back to our previous example and look at how we define the first convolution layer:
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
In that line of code, we define the first convolution layer as having 16 output filters, meaning our output from this layer is actually 16 filters. We then set the kernel size to (3,3), which represents a 3x3 matrix , just as in our example. Note how we don't specify the values of the various kernel filter weights, as that is after all what the network is training to do.
Let's see how this looks when everything is put together in the following diagram:
Full convolution operation
The output from the first step in convolution is the feature map. One feature map represents a single convolution filter being applied and is generated by applying the learned filter/kernel. In our example, the first layer produces 16 kernels, which in turn produce 16 feature maps; remember that the value of 16 is for the number of filters.
After convolution, we then apply pooling or subsampling in order to collect or gather features into sets. This subsampling further creates new concentrated feature maps that highlight the image's important features we are training for. Take a look back at how we defined the first pooling layer in our previous example:
x = MaxPooling2D((2, 2), padding='same')(x)
In the code, we are subsampling using a pool_size of (2,2). The size indicates the factor by which to down-sample the image by width and height. So a 2 x 2 pool size will create four feature maps at half the size in width and height. This results in a total of 64 feature maps after our first layer of convolution and pooling. We get this by multiplying 16 (convolution feature maps) x 4 (pooling feature maps) = 64 feature maps. Consider how many total feature maps we build in our simple example:
That is 65,536 feature maps of 4 x 4 images. This means we now train our network on 65,536 smaller images; for each image, we attempt to encode or classify. This is obviously the cause for the increased training time, but also consider the amount of extra data we are now using to classify our images. Now our network is learning how to identify parts or features of our image, just as we humans identify objects.
For instance, if you were just shown the nose of a dog, you could likely recognize that as a dog. Consequently, our sample network now is identifying parts of the handwritten digits, which as we know now, dramatically improves performance.
As we have seen, convolution works well for identifying images, but the process of pooling can have disruptive consequences to preserving spatial relationships. Therefore, when it comes to games or learning requiring some form of spatial understanding, we prefer to limit pooling or eliminate altogether. Since it is important to understand when to use and not to use pooling, we will cover that in more detail in the next section.