Those working with Neural Networks know how complicated Object Detection techniques can be. It is no wonder there is no straight forward resource for training them. You are always required to convert your data to a COCO-like JSON or some other unwanted format. It is never a plug and play experience. Moreover, no diagram thoroughly explains Faster R-CNN or YOLO as there is for U-Net or ResNet. There are just too many details.
While these models are quite messy, the explanation for their lack of simplicity is quite straight forward. It fits in a single sentence:
Neural Networks have fixed-sized outputs
In object detection, you can’t know a priori how many objects there are in a scene. There might be one, two, twelve, or none. The following images all have the same resolution but feature different numbers of objects.
The one million dollar question is: How can we build variable-sized outputs out of fixed-sized networks? Plus, how are we supposed to train a variable number of answers and loss terms? How can we penalize wrong predictions?
If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
Implementing Variable Sized Predictions
To create outputs that vary in size, two approaches dominate the literature: the “one size fits all” approach, an output so broad that it suffices for all applications, and the “look-ahead” idea, we search for regions-of-interest, and then we classify them.
I just made up those terms 😄. In practice, they are known as “one-stage” and “two-stage” approaches, which is a tad less self-explanatory.
One Stage Approaches
Overfeat, YOLO, SSD, RetinaNet, etc.
If we can’t have variable-sized outputs, we shall return an output so large that it will always be larger than what we need, then we can prune the excess.
The whole idea is to take the greedy route. The original YOLO detector can detect up to 98 bounding boxes for a 448×448 image. It sounds absurd — and it is. See for yourself:
This is a mess! Yet, you can see there is a percentage along with each box. This percentage shows the “confidence” the algorithm has on the classification. If we threshold this to some value, such as 50%, we get the following:
Much better! This pretty much sums up the one-stage approach: generate a massive (but fixed) set of detections and prune away the clutter, typically mixing a threshold and Non-Maximal Suppression (NMS).
This approach is highly regarded for its speed. A single network can process an entire image and output detections in one go. To this day, one-stage detectors are favored whenever speed is of most significant concern.
The downside is its high memory cost and lower detection accuracy. Each box consumes memory proportional to the number of classes, and the number of boxes grows quadratically with the image resolution. This hunger can be quite costly when there are many classes and a high input resolution. On top of that, the network has to jointly locate and classify objects, which harms the performance of both tasks.
RCNN, Fast-RCNN, Faster-RCNN, , etc.
If we can’t have variable-sized outputs, let’s search for regions of interest and process each one on their own.
In other words, this approach decouples the bounding boxes from the detections. At the first stage, the algorithm proposes regions. Then, we classify them using a dedicated network. The early-stage looks like the following:
With our regions-of-interest ready, we can process them one by one, yielding their respective classes and a confidence score used for the final pruning. Here is the result:
Now, we got some excellent detection and almost no clutter. In comparison to one-stage approaches, this technique uses dedicated networks for region proposal and region classification. This idea allows both stages to be developed independently, and much work has been put into sharing insights from the first stage to the second for faster detection.
The obvious advantage of this approach is its accuracy. By decoupling location from classification, both tasks are handled by specialized networks. The downside, on the other hand, is the speed, as you need the intermediate region proposal step, and you need to run the classifier network for each proposal. Thus, the time taken is proportional to the number of detections.
Training Variable-Sized Outputs
Now we know how to handle the output size problem, the final question is, how do we train such networks. Thankfully, in both cases, the procedure is roughly the same.
Training an object detection algorithm is like raising a child. You got to tell the kid what’s right and wrong. However, If you praise or complain too much, you will end up either spoiling or traumatizing the kid. In objection detection terms, we shall praise only the best detections and punish only the worst mistakes, while saying nothing of the rest.
Considering the set of ground truth objects, we shall praise detections with an Intersection-over-Union (IoU) above 0.7 with a ground truth box and punish those below 0.3. This will create a gradient signal that focuses on really good detections, downplays the only really wrong ones, and leaves the rest.
One simple thing you can add is to downplay detections that have an IoU between 0.1 and 0.3 only; thus, being a bit less punitive. You can also balance how many boxes you consider positive and negative, balancing out the contribution of positive and negative samples.
A step further is to use some form of hard negative mining. The overall idea is to use the model’s loss to sort detections from worst to best. This way, we have a more principled way of selecting what to praise and what to punish. This paper is a useful reference on the matter.
Detection vs. Segmentation
So far, we have been dealing with object detection: finding bounding boxes for objects in a scene. For humans, this is an easy task: we can easily detect things, and we can quickly draw rectangles. A more challenging task is segmentation.
Image segmentation is to draw a mask that outlines objects. For instance, instead of a rectangle around a person, we need to draw its outline fully. This is harder because it is harder to draw an object than a rectangle, and things can blend with their background as well, which makes it more challenging.
For neural networks, however, this is easier. Instead of having a variable-sized output, we have to classify each pixel, thus making a mask. Therefore, we need one output pixel for each input pixel. Here is an example of one of the above scenes processed by a people segmentation tool:
It is not entirely perfect, but it is a lovely job. Conceptually, it is a much harder problem. However, at the network architecture level, it is far more manageable.
If we leverage both frameworks at once, we can quickly get what is called “instance segmentation,” which is the task of segmenting different objects with different objects, such as in the following:
The general idea is to segment the results of each bounding box. This way, the bounding boxes are the “instances,” and the segmentation does, well, the segmentation :). While this is simplified, this is the general idea behind the Mask R-CNN algorithm.
In this article, I covered why object detection algorithms are so much more complicated than other networks and how authors have been dealing with the variable output size problem. Then, I briefly compared how image segmentation models look at this problem and how both approaches can be combined into an instance segmentation framework.
Overall, I didn’t discuss any architecture in particular. In fact, all presented concepts are simplifications for didactic purposes. Each model does it a little bit differently, introducing concepts such as anchors, smoothing, and novel losses. As said, it can get quite complicated. 😔
If you want to keep reading, this paper is a late 2019 survey on deep learning based object detection techniques.
Thanks for reading 🙂
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Read more articles from the author on Medium.
Enjoy this article? Sign up for more computer vision updates.
We’ll let you know when we release more technical education.