Computer vision is concerned with the theory and technology for building artificial systems that obtain information from images or multi-dimensional data.
Object Detection is used almost everywhere these days. The use cases are endless, be it Tracking objects, Video surveillance, Pedestrian detection, Anomaly detection, People Counting, Self-driving cars or Face detection, the list goes on.

A convolutional neural network (CNN) is mainly for image classification. While an R-CNN, with the R standing for region, is for object detection. A typical CNN can only tell you the class of the objects but not where they are located.

Fast R-CNN and faster R-CNN for faster speed object detection.
R-CNN takes a huge amount of time to train the network and cannot be implemented in real time as it takes many seconds for each test image

Fast R-CNN

The approach to Fast R-CNN is similar to the R-CNN algorithm. But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map.
It faster than R-CNN, because you don't have to feed 2000 region proposals to the convolutional neural network every time.

Faster R-CNN

Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map. Instead of using a selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals.


Positive-Sensitive Score Maps (Object Detection)

In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, R-FCN propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection
In R-FCN, we still have RPN to obtain region proposals, but unlike R-CNN series, FC layers after ROI pooling are removed. Instead, all major complexity is moved before ROI pooling to generate the score maps.
All region proposals, after ROI pooling, will make use of the same set of score maps to perform average voting, which is a simple calculation. R-FCN is even faster than Faster R-CNN.

SSD- Single Shot MultiBox Detector

The tasks of object localization and classification are done in a single forward pass of the network.

SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.

MultiBox is the name of a technique for bounding box regression. The network is an object detector that also classifies those detected objects.

SSD attains a better balance between swiftness and precision. SSD runs a convolutional network on input image only one time and computes a feature map

YOLO — You Only Look Once

All of the previous object detection algorithms use regions to localize the object within the image.
You Only Look Once is an object detection algorithm much different from the region based algorithms seen above.

In YOLO a single convolutional network predicts the bounding boxes and the class probabilities for these boxes.

The limitation of YOLO algorithm is that it struggles with small objects within the image, for example, it might have difficulties in detecting a flock of birds.

This is due to the spatial constraints of the algorithm.