Image classification using Convolution Neural Networks (CNN)

Image Classification and Object Recognition

Image classification and Object Recognition is probably the most well-known problem in computer vision. As a human, our cognitive ability to make decisions depends upon our visual aids, providing such abilities to the computer would allow them to experience the same power. Image classification consists of classifying an image into one of many different categories. For e.g. an image of an animal can be classified into its appropriate category. In recent years a lot of research has been done in the field of image classification and also the accuracy to classify the images has been improved with the implementation of new deep algorithms. Object detection is one more computer technology related to computer vision and image processing that is useful in detecting occurrences of semantic objects of a confident class, for e.g. humans, buildings, or cars in digital images and videos. Such accuracy in image classification and object recognition is obtained with the help of Deep Learning models. Deep Learning is an artificial intelligence function that copies the workings of the human mind in processing data and creating patterns which is useful for decision making. It is a subset of machine learning in Artificial Intelligence (AI) that has networks which are capable of learning unsupervised from data that is unstructured or unlabelled. Common techniques comprise of deep learning based methods such as “convolutional neural networks”, and “feature-based methods using edges”, “gradients”, “histogram of oriented gradients (HOG)”, “Haar wavelets”, and “linear binary patterns”.

Deep learning

Computer Vision is a hot buzz word in this era of technology. Computer Vision is a field of Artificial Intelligence and Computer Science that aims at giving computers a visual understanding of the world. It is one of the main components of machine understanding. Applications of computer vision are vast. Applications range from tasks such as industrial machine vision systems which, say, inspect bottles speeding by on a production line, to research into artificial intelligence and computers or robots that can understand the realm around them. Deep learning is a technique through which we can attain high performing results in the field of Computer Vision. Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example. Deep Learning has contributes tremendously in ground breaking research and discovery such as driverless cars, which are able to identify different kinds of signs on the street and also differentiate between people and other non-living objects. It also has its application in electronic devices like phones, tablets, TVs, and hands-free speakers, where it is responsible for the voice control. It’s achieving results that were not possible before. In deep learning, a computer model is designed such a way that it directly learns to classify objects from text, sound or images. Deep learning models are capable of achieving state-of-the-art accuracy, and can even possibly do better than human performance. For this purpose, deep learning model have to be provided with large set of labelled data and they are trained with the help of neural networks which contain many layers. Deep learning requires large amounts of labelled data. For example, driverless car development requires millions of images and thousands of hours of video. Deep learning requires substantial computing power. High-performance GPUs have a parallel architecture that is efficient for deep learning.

Working of Deep Learning

Mostly, deep learning models make a use of neural networks and hence they are called deep neural networks.

The term “deep” depicts the number of hidden layers used in the neural network. The more number of hidden layers the more deep is the neural network. Previously, the neural networks only contained 2-3 hidden layers, while deep networks can have as many as 150.

Deep learning models are trained by using large sets of labelled data and neural network architectures that learn features directly from the data without the need for manual feature extraction.

Convolution Neural Networks

Convolution Neural Networks have a range of applications in image and video recognition, recommender systems and natural language processing. CNNs, similar to neural networks, are consisted of neurons with learnable weights and biases. Each and every neuron in the CNN works as an input/output combination. It receives several inputs, sums the input and then pass it through an activation function, and then answers as an output. The whole CNN network has a loss function and all the steps which are applicable to the neural network pretty well applied to the CNN’s too. Convolutional Neural Networks (CNN) have been very fruitful in recognition of document. The main difference between a normal neural network and a CNN is that, the CNN consists of several convolution layer. Along with the convolution layers, the CNN also may consists several pooling layers and at last there is a fully connected layers which is connected to the output layer. CNN being a supervised learning algorithm, back-propagation is used to learn parameters of different layers. As the CNN may contain huge number of layers for the classification purpose, it has only been applied to relatively small images in the literature. With the increasing computational power of GPU, it is now possible to train a deep convolutional neural network on a large scale image dataset. Indeed, in the past several years, CNN has been successfully applied to “scene parsing”, “feature learning”, “visual recognition and image classification”.

Working of Convolution Neural Network's

Step – 1 The Convolution:

ConvNet derive their name from the "convolution" operator. The main aim of the convolution is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. Every image can be considered as a matrix of pixel values. In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map‘. This filter is slide over the actual image by 1 pixel, called the ‘stride’ and for every position element wise multiplication, between the two matrices, is computed and the multiplication outputs is added to get the final integer which forms a single element of the output matrix.

Step - 2 Max/Min Pooling:

Spatial Pooling, also called subsampling or downsampling reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc. In case of Max Pooling, we define a spatial neighbourhood, for example, a 2×2 window, and take the largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average or sum of all elements in that window. In practice, Max Pooling has been shown to work better.

Step - 3 Fully Connected Layers:

The Fully Connected layer is a traditional Multi Layer Perceptron that uses a softmax activation function in the output layer. The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer. The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset. Apart from classification, adding a fully-connected layer is also a cheap way of learning non-linear combinations of these features. Most of the features from convolutional and pooling layers may be good for the classification task, but combinations of those features might be even better. The sum of output probabilities from the Fully Connected Layer is 1. This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.