A few weeks ago, I created a Tensorflow model that would take an image of a street, break it down into regions, and - using a convolutional neural network - highlight areas of the image that contained vehicles, like cars and trucks.
I called it an image classification AI, but what I had really created was an object detection and location program, following a typical convolutional method that has been used for decades - as far back as the seventies.
By cutting up my input image into regions and passing each one into the network to get class predictions, I had created an algorithm that roughly locates classified objects. Other people have created programs that perform this operation frame by frame on real time video, allowing computer programs to draw boxes around recognised objects, understand what they are and track their motion through space.
In this article, I’ll give an interesting introduction to object detection in real-time video. I’ll explain why this kind of artificial intelligence is important, how it works, and how you can implement your own system with Yolo V3. From there, you can build a huge variety of real time object tracking programs, and I’ve included links to further material too.
Importance of real-time object detection
Object detection, location and classification has really become a massive field in machine learning in the past decade, with GPUs and faster computers appearing on the market, which has allowed more computationally expensive deep learning nets to run heavy operations. Real time object detection in video is one such AI, and it has been used for a wide variety of purposes over the past few years.
In surveillance, convolutional models have been trained on human facial data to recognise and identify faces. An AI can then analyse each frame of a video and locate recognised faces, classifying them with remarkable precision.
Real-time object detection has also been used to measure traffic levels on heavily frequented streets. AIs can identify cars and count the number of vehicles in a scene, and then track that number over time, providing crucial information about congested roads.
In wildlife, with enough training data a model can learn to spot and classify types of animals. For example, a great example was done with tracking racoons through a webcam, here. All you need is enough training images to build your own custom model, and such artificial intelligence programs are actively being used all around the world.
Background to Yolo V3
Until about ten years ago, the technology required to perform real-time object tracking was not available to the general public. Fortunately for us, in 2021 there are many machine learning libraries available and practically anyone can get started with these amazing programs.
Arguably the best object detection algorithms for amateurs - and often even professionals - is You Only Look Once, or YOLO. This collection of algorithms and datasets was created in the 2000s and became incredibly popular thanks to its impressive accuracy and speed, which lends it easily to live computer vision.
My method for object detection and recognition I mentioned at the start of this article happens to be a fairly established technique. Traditional object recognition would split up each frame of a video into “regions”, flatten them into strings of pixel values, and pass them through a deep learning neural network one by one. The algorithm would then output a 0 to 1 value indicating the chance that the specific region has a recognized object - or rather, a part of a recognized object - within its bounds.
Finally, the algorithm would output all the regions that were above a particular “certainty” threshold, and then it would compile adjacent regions into bounding boxes around recognized objects. Fairly straightforward, but when it comes down to the details, this algorithm isn’t exactly the best.
Yolo V3 uses a different method to identify objects in real time video, and it’s this algorithm that gives it its desirable balance between speed and accuracy - allowing it to fairly accurately detect objects and draw bounding boxes around them at about thirty frames per second.
Darknet-53 is Yolo’s latest Fully Convolutional Network, or FCN, and is packed with over a hundred convolutional layers. While traditional methods pass one region at a time through the algorithm, Darknet-53 takes the entire frame, flattening it before running the pixel values through 106 layers. They systematically split the image down into separate regions, predicting probability values for each one, before assembling connected regions to create “bounding boxes” around recognized objects.
Luckily for us there’s a really easy way we can implement YoloV3 in real time video simply with our webcams; effectively this program can be run on pretty much any computer with a webcam. You should note however that the library does prefer a fast computer to run at a good framerate. If you have a GPU it’s definitely worth using it!
The way we’ll use YoloV3 is through a library called ImageAI. This library provides a ton of machine learning resources for image and video recognition, including YoloV3, meaning all we have to do is download the pre-trained weights for the standard YoloV3 model and set it to work with ImageAI. You can download the YoloV3 model here. Place this in your working directory.
We’ll start with our imports as follows:
import numpy as np
from imageai import Detection
Of course, if you don’t have ImageAI, you can get it using “pip install imageai” on your command line or Python console, like normal. CV2 will be used to access your webcam and grab frames from it, so make sure any webcam settings on your device are set to default so access is allowed.
Next, we need to load the deep learning model. This is a pre-trained, pre-weighted Keras model that can classify objects into about a hundred different categories and draw accurate bounding boxes around them. As mentioned before, it uses the Darknet model. Let’s load it in:
modelpath = "path/yolo.h5"
yolo = Detection.ObjectDetection()
All we’re doing here is creating a model and loading in the Keras h5 file to get it started with the pre-built network - fairly self-explanatory.
Then, we’ll use CV2 to access the webcam as a camera object and define its parameters so we can get those frames that are needed for object detection:
cam = cv2.VideoCapture(0)
You’ll need to set the 0 in cv2.VideCapture(0) to 1 if you’re using a front webcam, or if your webcam isn’t showing up with 0 as the setting. Great, so we have imported everything, loaded in our model and set up a camera object with CV2. We now need to create a run loop:
ret, img = cam.read()
This will allow us to get the next immediate frame from the webcam as an image. Our program doesn’t run at a set framerate; it’ll go as fast as your processor/camera will allow.
Next, we need to get an output image with bounding boxes drawn around the detected and classified objects, and it’ll also be handy to get some print-out lines of what the model is seeing:
img, preds = yolo.detectCustomObjectsFromImage(input_image=img,
As you can see, we’re just using the model to predict the objects and output an annotated image. You can play around with the minimum_percentage_probability to see what margin of confidence you want the model to classify objects with, and if you want to see the confidence percentages on the screen, set display_percentage_probability to True.
To wrap the loop up, we’ll just show the annotated images, and close the program if the user wants to exit:
if (cv2.waitKey(1) & 0xFF == ord("q")) or (cv2.waitKey(1)==27):
Last thing we need to do outside the loop is to shut the camera object;
And that’s it! It’s really that simple to use real time object detection in video. If you run the program, you’ll see a window open that displays annotated frames from your webcam, with bounding boxes displayed around classified objects.
Obviously we’re using a pre-built model, but many applications make use of YoloV3’s standard classification network, and there are plenty of options with ImageAI to train the model on custom datasets so it can recognize objects outside of the standard categories. Thus, you’re not sacrificing much by using ImageAI.
Good luck with your projects if you choose to use this code!
Yolo V3 is a great algorithm for object detection that can detect a multitude of objects with impressive speed and accuracy, making it ideal for video feeds as we showed on the examples aboves.
Yolo v3 is important but it’s true power comes when combined with other algorithms that can help it process information faster, or even increasing the number of detected objects. Similar algorithms to these are used today in the industry and have been perfected over the years.
Today self-driving cars for example will use techniques similar to those described in this article, together with lane recognition algorithms and bird view to map the surroundings of a car and pass that information to the pilot system, which then will decide the best course of action.