Computer Vision for tracking

10 min readApr 16, 2019

An updated version of this article has been published on https://www.thinkautonomous.ai/blog/?p=computer-vision-for-tracking

In Computer Vision, one of the most interesting area of research is obstacle detection using Deep Neural Networks. A lot of papers went out, all achieving SOTA (State of the Art) in detecting obstacles with a really high accuracy. The goal of these algorithms is to predict a list of bounding boxes from an input image. Machine Learning has evolved really well into localising and classifying obstacles in real-time in an image. However, none of these algorithm include the notion of time and continuity. When detecting an obstacle, these algorithms assume it’s a new obstacle every time.

Here’s one of the most popular object detection algorithm, called YOLO (You Only Look Once).

Yolov3 demo

I won’t go into the details of the algorithm here, but you can have a look at this video from Siraj Raval that explains it very well.

The output of the algorithm is a list of bounding box, in format [class, x, y, w, h, confidence]. The class is an id related to a number in a txt file (0 for car , 1 for pedestrian, …). x, y, w and h represent the parameters of the bounding box. x and y are the coordinates of the center while w and h are its size (width and height). The confidence is a number expressed in %.

How can we use these bounding boxes ?

So far, bounding boxes have been used to count the number of obstacles of the same class in a crowd, in self-driving cars, drones, surveillance cameras, autonomous robots, and all sorts of systems using Computer Vision. However, they all share the same limit : same class obstacles are from the same color and cannot be set apart.

Deep SORT (Deep Simple Online Realtime Tracking)

So how could we define these bounding boxes as independant and how can we track them through time ?

Deep SORT demo

This implementation uses an object detection algorithm, such as YOLOv3 and a system to track obstacle. As you can see, it works with occlusion as well. How ? The reason is the use of a Kalman Filter and The Hungarian Algorithm.

A Hungarian algorithm can tell if an object in current frame is the same as the one in previous frame. It will be used for association and id attribution.
A Kalman Filter is an algorithm that can predict future positions based on current position. It can also estimate current position better than what the sensor is telling us. It will be used to have better association.

📩 Before we start, I invite you to receive my daily emails on self-driving cars, Computer Vision, and Cutting-Edge Tech! You’ll receive career tips, technical content, and untold stories from the industry to help you build your dream job.

To join, it’s here!

The Hungarian Algorithm (Kuhn-Munkres)

The hungarian algorithm, also known as Kuhn-Munkres algorithm, can associate an obstacle from one frame to another, based on a score. We have many scores we can think of :

IOU (Intersection Over Union); meaning that if the bounding box is overlapping the previous one, it’s probably the same.

Shape Score ; if the shape or size didn’t vary too much during two consecutives frames; the score increases.
Convolution Cost ; we could run a CNN (Convolutional Neural Network) on the bounding box and compare this result with the one from a frame ago. If the convolutional features are the same, then it means the objects looks the same. If there is a partial occlusion, the convolutional features will stay partly the same and association will remain.

How to do the association ?

In this example, from frame a to frame b, we are tracking two obstacles (with id 1 and 2), adding one new detection (4) and keeping a track (3) in case it’s a false negative.

The process for obstaining this is the following :

We have two lists of boxes from YOLO : a tracking list (t-1) and a detection list (t).
Go through tracking and detection list, and calculate IOU, shape, convolutional score. For this example, let’s consider IOUs only, we could also have a cost function giving importance to each score. For convolutions, cosine distance metrics would be used. Store the IOU scores in a matrix …

3. In some cases of overlapping bounding boxes, we can have two or more matches for one candidate. In this case, we set the maximum IOU value to 1 and all the others to 0.

We have a matrix that tells us matching between Detection and Trackings. The next thing is to call a sklearn function called linear_assignment() that implements the Hungarian Algorithm. This algorithm uses bipartite graph (graph theory) to find for each detection, the lowest tracking value in the matrix. Since we have scores and not costs, we will replace our 1 with -1; the minimum will be found. For our example, the hungarian matrix will be :

We can then check the values missing in our Hungarian Matrix and consider them as unmatched detections, or unmatched trackings. For detections, please consider A, B, C to have id 0,1,2.

Here is the code to interprete this.

NB : In this code example, the IOU = 0 values are present in the Hungarian Matrix but dismissed during the loop as inferior to a threshold.

What we get from that is matrix of what element in detection matches what element in tracking.
From that, we can output matched detections, unmatched detections and unmatched trackings.

The Kalman Filter

Kalman Filters are very popular for tracking obstacles and predicting current and future positions. It is used in all sort of robots, drones, self-flying planes, self-driving cars, multi-sensor fusion, …

→ For an understanding on Kalman Filters logic, go check my Sensor Fusion article.

A Kalman Filter is used on every bounding box, so it comes after a box has been matched. When the association is made, predict and update functions are called. These functions implement the math of Kalman Filters composed of formulas for determining state mean and covariance.

State Mean and Covariance

Mean and Covariance are what we want to estimate. Mean is the coordinates of the bounding box, Covariance is our uncertainty on this bounding box having these coordinates.

Mean (x) is a state vector. It is composed by coordinates of the center of the bounding box (cx,cy), size of the box (width, height) and the change of each of these parameters, velocities.

When we initialize this parameter, we set velocities to 0. They will then be estimated by the Kalman Filter.

Covariance (P) is our uncertainty matrix in the estimation. We will set it to an arbitrary number and tweak it to see results. A larger number means a larger uncertainty.

That’s all we need to estimate : a state and an uncertainty ! There is two steps for a Kalman Filter to work : prediction and update. Prediction will predict future positions, update will correct them and enhance the way we predict by changing uncertainty. With time, a Kalman Filter gets better and better to converge.

How to use ?

At time t=0, we have a measurement of 3 bounding boxes. The Hungarian Algorithm defines them at 3 new detections. We therefore only have 3 detections in our system. For each box, we inialize Kalman Matrices with coordinates of the bounding boxes.
At time t=1, we have 3 bounding boxes, of the same object. The Hungarian Algorithm matches them with the 3 former boxes and we can start calling predict and update. We predict the actual bounding boxes at time t from the bounding boxes at time t-1 and then update our prediction with the measurement at time t.

Prediction

Prediction phase is matrix multiplication that will tell us the position of our bounding box at time t based on its position at time t-1.

So imagine we have a function called predict() in a class Kalman Filter that implements these maths. All we need to do is to have correct matrices F, Q and u. We will not use the u vector as it is used to estimate external forces, which we can’t really do easily here.

How to set up the matrices correctly ?

State Transition : F

F is the core implementation of what we will define. What we put here is important because when we will multiply x by F, we will change our x and have a new x, called x’.

As you can see, F [8x8] matrix contains a time value : dt is the difference between current frame and former frame timestamp. We will end up having cx’ = cx + dt*vx for the first line, cy’ = cy + dt*vy, and so on…

In this case, it means that we consider the velocity to be constant. We therefore set up F to implement the Constant Velocity model. There are a lot of models we can use depending on the problem we want to solve.

Process Covariance Matrix : Q

Q [8x8] is our noise matrix. It is how much confidence we give in the system. Its definition is very important and can change a lot of things. Q will be added to our covariance and will then define our global uncertainty. We can put very small values (0.01) and change it with time.

Based on these matrices, and our measurement, we can now make a prediction that will give us x’ and P’. This can be used to predict future or actual positions. Having a matrix of confidence is useful for our filter that can model uncertainty of the world and of the algorithm to get better results.

Update

Update phase is a correction step. It includes the new measurement (z) and helps improve our filter.

The process of update is to start by measuring an error between measurement (z) and predicted mean. The second step is to calculate a Kalman Gain (K). The Kalman gain is used to estimate the importance of our error. We use it as a multiplication factor in the final formula to estimate a new x.

Again, How to set up the matrices correctly ?

Measurement Vector: Z

z is the measurement at time t. We don’t input velocities here as it is not measured, simply measured values.

Measurement Vector

Measurement Matrix : H

H [4x8] is our measurement matrix, it simply makes the math work between all or different matrices. We put the ones according to how we defined our state, and its dimension highly depends on how we define our state.

Measurement Noise : R

R is our measurement noise; it’s the noise from the sensor. For a LiDAR or RADAR, it’s usually given by the constructor. Here, we need to define a noise for YOLO algorithm, in terms of pixels. It will be arbitrary, we can say that the noise in terme of the center is about 1 or 2 pixels while the noise in the width and height can be bigger, let’s say 10 pixels.

Matrix multiplications are made; and we have a prediction and update loop that gets us better results than the yolo algorithm. The filter can also be used to predict at time t+1 (prediction with no update) from time t. For that, it needs to be good enough and have a low uncertainty.

Conclusion

We now understand how to track an obstacle through time. The general process is to detect obstacles using an object detection algorithm, match these bounding box with former bounding boxes we have using The Hungarian Algorithm; and then predict future bounding box positions or actual positions using Kalman Filters. From that, we could use Machine Learning to predict future behavior or trajectories, we could be able to estimate what an obstacle has been doing the last 10 seconds. This tool is powerful and tracking become not only possible, but also very accurate. Velocities can be estimated and a huge set of possibilites becomes available.

Go Further

📩 Join the Private Mailing List and Receive daily content on Self-Driving Cars
Edit: After the success of this article, I decided to create an online course so you can build it! Visit the course here.

Photo by Samuele Errico Piccarini on Unsplash

Jeremy Cohen.