What is Convolutional Neural Network
In the field of computer vision, one of the most promising innovations in recent years has been convolutional neural networks. In fact, they have achieved remarkable relevance as their record of misclassifications has drastically reduced, allowing Alex Krizhevsky to win the 2012 world computer vision competition. Since then, even the biggest tech companies have used them: Facebook uses them for their auto-tagging algorithms, Google for photo search, Amazon for product recommendations, Pinterest for home feed personalization, and Instagram for home feed.
What is meant by a convolutional neural network?
The Convolutional Neural Networks (CNN) is the architecture of the artificial neural network of great success in machine vision applications and widely used in applications that they process media such as audio and video. The most popular applications of the convolutional neural network are the decoding facial recognition and analyzing documents. Through a convolutional neural network, the computer is able to classify what an image shows and to identify its content with a good probability. Note that neural networks are built to analyze images included within certain data sets (e.g. animal datasets include images of animals, facial recognition datasets include images of faces, vehicle datasets include images of vehicles, and so on), and classify objects in images within them. For example, you can't get positive feedback from a computer if you have a convolutional neural network analyze an image of a human face that only accepts vehicle images, as it doesn't understand the shapes and objects that the new image represents. After this brief introduction, let's see an example of the architecture of this type of neural network.
Convolutional neural network architecture
Convolutional neural network architecture can be formed by:
Here Input, Conv, ReLU level, Pool and Fully Connected each identifies a level of the convolutional neural network.
In other words, we have the:
- Input level: represents the set of numbers that represents, for the computer, the image to be analyzed. It is represented as a set of pixels. For example, 32 x 32 x 3 indicates the width (32), height (32) and depth (3, the three colors Red, Green and Blue in the RGB format) of the image.
- Convolutional level (Conv): it is the main level of the network. Its goal is to identify patterns, such as curves, angles, circles or squares depicted in an image with high accuracy. There are more than one, and each of them focuses on finding these characteristics in the initial image. The greater their number, the greater the complexity of the characteristic they are able to identify.
- ReLU level (Rectified Linear Units): it aims to cancel negative values obtained in previous levels and is usually placed after the convolutional levels.
- Pool level: It allows you to identify if the study feature is present in the previous level. Simplifies and coarser the image, keeping the characteristic used by the convolutional layer.
- FC level (or fully connected, completely connected): connects all the neurons of the previous level in order to establish the various identification classes displayed in the previous levels according to a certain probability. Each class represents a possible final answer that the computer will give you.
For example, in CIFAR-10 (which is an image dataset used for machine learning and artificial vision algorithms ), the computer can decide among 10 classes what the image represents, based on the results obtained from the previous levels it will choose the one with the greatest chance.
Now let's see in a little more detail what these levels represent.
What is Convolutional level
Suppose that the input of a CNN is an image representing a 7 (very simple example, but to give the idea), which accepts images from the MNIST handwritten dataset. In machine language, this figure can be represented by an array of 28 x 28 x 3 pixels.
To proceed with understanding how it works and what a CNN represents, you need to know what filters are and what step represents. By filter generally, we mean a small matrix of few rows and columns that represents a feature that the convolutional level wants to identify, for example, curves or a straight line.
Initially, for the first levels, it is said that the filter represents a low-level feature because it identifies simple objects such as curves or lines. For a convolutional level, the filter will identify curves, for other horizontal lines, for yet other circumferences, and so on in the last levels, up to form complex figures that will represent more complicated objects. In the latter case, the filter is said to represent a high-level feature because it identifies complex objects, such as a bird's beak, a hand or a face. Let's assume that the filter is a curve detector.
Having identified the characteristic that the filter will identify in the convolutional level, we decide the size of the filter and the number of filters to be used in the level. In this example, we decide to use a 3 × 3 filter (i.e. 3 rows and 3 columns), which for the first convolutional level takes on random values (also called weights).
As for the number, let's assume now only one filter for simplicity, even if in reality those used are different for each level. Then we start to analyze the called receptive field, which has the same size as the filter (therefore 3 x 3). It is initially represented by the first 3 x 3-pixel block at the top left of the input layer. The result, which we will obtain at the top left in the next level (Conv1) of the neural network, is obtained by making a scalar product of the values of the filter with the values of this first block (i.e. making multiplications at the element level for the pixels 3 x 3 of the receptive field and the weights of the neurons of the 3 x 3 filter, and finally the result will be the sum of the various products thus obtained). This gives us a unique number: this value will be higher in the vicinity of curves, and lower otherwise.
Since in the first result is not close to curves it will assume the value 0 (moreover each pixel of the input volume assumes a null value, so the result of the scalar product is equal to zero). The above operation must be repeated for all the blocks that the input image can contain. As a result, the receptive field is made to move a certain step (or stride) to the right. For example, assuming we have an input volume 7 x 7 (number of rows and columns of the input level), a 3 x 3 filter and a step of 1, the receptive field (always 3 x 3) will move by one drive to the right, as in the following image:
If the step were 2, the representation would be as follows:
The receptive field is shifting 2 units now, and thus the output volume shrinks. For the initial example, we assume one step for each convolution (i.e. step = 1). The passage then shifts the receptive field to the right by 1 unit, to analyze the second block (and so on for all the others, until it covers the entire input volume).
After sliding the receptive field overall positions, we get a matrix of numbers 26 x 26 x 1. The reason it is 26 x 26 is that we get 686 different pixels that a 3 x 3 filter can analyze in an image 28 x 28, as was the one proposed at the beginning.
The depth, on the other hand, is equal to 1 because we used only one filter in this example.
The set of values obtained by following the procedure just described is called the activation map (represented in this case by the 26 x 26 x 1 matrix). If we had more numbers of filters, the same procedure had to be repeated, for a number of times equal to the number of filters of the level (n), obtaining n activation maps.
Author: Vicki Lezama