Before CNNs, the standard way to train a neural network to classify images was to flatten it into a list of pixels and pass it through a feed-forward neural network to output the image’s class. The problem with flattening the image is that the essential spatial information in the image is discarded.
In 1989, Yann LeCun and team introduced Convolutional Neural Networks — the backbone of Computer Vision research for the last 15 years! Unlike feedforward networks, CNNs preserve the 2D nature of images and are capable of processing information spatially!
In this article, we are going to go through the history of CNNs specifically for Image Classification tasks — starting from those early research years in the 90’s to the golden era of the mid-2010s when many of the most genius Deep Learning architectures ever were conceived, and finally discuss the latest trends in CNN research now as they compete with attention and vision-transformers.
Check out the YouTube video that explains all the concepts in this article visually with animations. Unless otherwise specified, all the images and illustrations used in this article are generated by myself during creating the video version.
At the heart of a CNN is the convolution operation. We scan the filter across the image and calculate the dot product of the filter with the image at each overlapping location. This resulting output is called a feature map and it captures how much and where the filter pattern is present in the image.
In a convolution layer, we train multiple filters that extract different feature maps from the input image. When we stack multiple convolutional layers in sequence with some non-linearity, we get a convolutional neural network (CNN).
So each convolution layer simultaneously does 2 things —
1. spatial filtering with the convolution operation between images and kernels, and
2. combining the multiple input channels and output a new set of channels.
90 percent of the research in CNNs has been to modify or to improve just these two things.
The 1989 Paper
This 1989 paper taught us how to train non-linear CNNs from scratch using backpropagation. They input 16×16 grayscale images of handwritten digits, and pass through two convolutional layers with 12 filters of size 5×5. The filters also move with a stride of 2 during scanning. Strided-convolution is useful for downsampling the input image. After the conv layers, the output maps are flattened and passed through two fully connected networks to output the probabilities for the 10 digits. Using the softmax cross-entropy loss, the network is optimized to predict the correct labels for the handwritten digits. After each layer, the tanh nonlinearity is also used — allowing the learned feature maps to be more complex and expressive. With just 9760 parameters, this was a very small network compared to today’s networks which contain hundreds of millions of parameters.
Inductive Bias
Inductive Bias is a concept in Machine Learning where we deliberately introduce specific rules and limitations into the learning process to move our models away from generalizations and steer more toward solutions that follow our human-like understanding.
When humans classify images, we also do spatial filtering to look for common patterns to form multiple representations and then combine them together to form our predictions. The CNN architecture is designed to replicate just that. In feedforward networks, each pixel is treated like it’s own isolated feature as each neuron in the layers connects with all the pixels — in CNNs there is more parameter-sharing because the same filter scans the entire image. Inductive biases make CNNs less data-hungry too because they get local pattern recognition for free due to the network design but feedforward networks need to spend their training cycles learning about it from scratch.