Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". Object detection isn't as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It's difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far. This makes object detection somewhat more complex, typically more verbose (but not always), and less approachable than image classification. Fortunately for the masses - Ultralytics has developed a simple, very powerful and beautiful object detection API around their YOLOv5 which has been extended by other research and development teams into newer versions, such as YOLOv7.
In this short guide, we'll be performing Object Detection in Python, with state-of-the-art YOLOv7.
YOLO Landscape and YOLOv7
YOLO (You Only Look Once) is a methodology, as well as family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years. Ultralytics' YOLOv5 is the first large-scale implementation of YOLO in PyTorch, which made it more accessible than ever before, but the main reason YOLOv5 has gained such a foothold is also the beautifully simple and powerful API built around it. The project abstracts away the unnecessary details, while allowing customizability, practically all usable export formats, and employs amazing practices that make the entire project both efficient and as optimal as it can be. YOLOv5 is still the staple project to build Object Detection models with, and many repositories that aim to advance the YOLO method start with YOLOv5 as a baseline and offer a similar API (or simply fork the project and build on top of it). Such is the case of YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author). YOLOv7 is the latest advancement in the YOLO methodology and most notably, YOLOv7 provides new model heads, that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn't standard with previous YOLO models.
This makes instance segmentation and keypoint detection faster than ever before!
<div class="alert alert-reference"> <div class="flex"> It was released alongside a paper named <a rel="nofollow noopener noreferrer" target="_blank" href="https://arxiv.org/abs/2207.02696">"<em>YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors"</em></a>, and the source code is available <a rel="nofollow noopener noreferrer" target="_blank" href="https://github.com/WongKinYiu/yolov7">on GitHub</a>. </div> </div> In addition, YOLOv7 performs faster and to a higher degree of accuracy than previous models due to a reduced parameter count and higher computational efficiency:
The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.
Installing and using YOLOv7 boils down to downloading the GitHub repository to your local machine and running the scripts that come packaged with it.
Note: Unfortunately, as of writing, YOLOv7 doesn't offer a clean programmatic API such as YOLOv5, that's typically loaded from
torch.hub(), passing the GitHub repository in. This appears to be a feature that should work but is currently failing. As it gets fixed, I'll update the guide or publish a new one on the programmatic API. For now - we'll focus on the inference scripts provided in the repository.
Even so, you can perform detection in real-time on videos, images, etc. and save the results easily. The project follows the same conventions as YOLOv5, which has an extensive documentation, so you're likely to find answers to more niche questions in the YOLOv5 repository if you have some.
Let's download the repository and perform some inference:
! git clone https://github.com/WongKinYiu/yolov7.git
This creates a
yolov7 directory in your current working directory, which houses the project. Let's move into that directory and take a look at the files:
%cd yolov7 !ls /Users/macbookpro/jup/yolov7 LICENSE.md detect.py models tools README.md export.py paper train.py cfg figure requirements.txt train_aux.py data hubconf.py scripts utils deploy inference test.py runs
Note: On a Google Colab Notebook, you'll have to run the magic
%cdcommand in each cell you wish to change your directory to
yolov7, while the next cell returns you back to your original working directory. On Local Jupyter Notebooks, changing the directory once keeps you in it, so there's no need to re-issue the command multiple times.
detect.pyis the inference scripts that runs detections and saves the results under
runs/detect/video_name, where you can specify the
video_namewhile calling the
export.pyexports the model to various formats, such as ONNX, TFLite, etc.
train.pycan be used to train a custom YOLOv7 detector (the topic of another guide), and
test.pycan be used to test a detector (loaded from a weights file).
Several additional directories hold the configurations (
cfg), example data (
inference), data on constructing models and COCO configurations (
YOLO-based models scale well, and are typically exported as smaller, less-accurate models, and larger, more-accurate models. These are then deployed to weaker or stronger devices respectively. YOLOv7 offers several sizes, and benchmarked them against MS COCO:
|Model||Test Size||APtest||AP50test||AP75test||batch 1 fps||batch 32 average time|
|YOLOv7||640||51.4%||69.7%||55.9%||161 fps||2.8 ms|
|YOLOv7-X||640||53.1%||71.2%||57.8%||114 fps||4.3 ms|
|YOLOv7-W6||1280||54.9%||72.6%||60.1%||84 fps||7.6 ms|
|YOLOv7-E6||1280||56.0%||73.5%||61.2%||56 fps||12.3 ms|
|YOLOv7-D6||1280||56.6%||74.0%||61.8%||44 fps||15.0 ms|
|YOLOv7-E6E||1280||56.8%||74.4%||62.1%||36 fps||18.7 ms|
Depending on the underlying hardware you're expecting the model to run on, and the required accuracy - you can choose between them. The smallest model hits over 160FPS on images of size 640, on a V100! You can expect satisfactory real-time performance on more common consumer GPUs as well.
Video Inference with YOLOv7
inference-data folder to store the images and/or videos you'd like to detect from. Assuming it's in the same directory, we can run a detection script with:
! python3 detect.py --source inference-data/busy_street.mp4 --weights yolov7.pt --name video_1 --view-img
This will prompt a Qt-based video on your desktop in which you can see the live progress and inference, frame by frame, as well as output the status to our standard output pipe:
Namespace(weights=['yolov7.pt'], source='inference-data/busy_street.mp4', img_size=640, conf_thres=0.25, iou_thres=0.45, device='', view_img=True, save_txt=False, save_conf=False, nosave=False, classes=None, agnostic_nms=False, augment=False, update=False, project='runs/detect', name='video_1', exist_ok=False, no_trace=False) YOLOR 🚀 v0.1-112-g55b90e1 torch 1.12.1 CPU Downloading https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt to yolov7.pt... 100%|██████████████████████████████████████| 72.1M/72.1M [00:18<00:00, 4.02MB/s] Fusing layers... RepConv.fuse_repvgg_block RepConv.fuse_repvgg_block RepConv.fuse_repvgg_block Model Summary: 306 layers, 36905341 parameters, 6652669 gradients Convert model to Traced-model... traced_script_module saved! model is traced! video 1/1 (1/402) /Users/macbookpro/jup/yolov7/inference-data/busy_street.mp4: 24 persons, 1 bicycle, 8 cars, 3 traffic lights, 2 backpacks, 2 handbags, Done. (1071.6ms) Inference, (2.4ms) NMS video 1/1 (2/402) /Users/macbookpro/jup/yolov7/inference-data/busy_street.mp4: 24 persons, 1 bicycle, 8 cars, 3 traffic lights, 2 backpacks, 2 handbags, Done. (1070.8ms) Inference, (1.3ms) NMS
Note that the project will run slow on CPU-based machines (such as 1000ms per inference step in the output above, ran on an Intel-based 2017 MacBook Pro), and significantly faster on GPU-based machines (closer to ~5ms/frame on a V100). Even on CPU-based systems such as this one,
yolov7-tiny.pt runs at
172ms/frame, which while far from real-time, is stil very decent for handling these operations on a CPU.
Once the run is done, you can find the resulting video under
runs/video_1 (the name we supplied in the
detect.py call), saved as an
Inference on Images
Inference on images boils down to the same process - supplying the URL to an image in the filesystem, and calling
! python3 detect.py --source inference-data/desk.jpg --weights yolov7.pt
Note: As of writing, the output doesn't scale the labels to the image size, even if you set
--img SIZE. This means that large images will have really thin bounding box lines and small labels.
<img src="data:image/gif;base64,R0lGODdhAQABAPAAAMPDwwAAACwAAAAAAQABAAACAkQBADs=" alt="" class="lozad" data-src="/images/posts/NOgyW-wZI.jpg">
In this short guide - we've taken a brief look at YOLOv7, the latest advancement in the YOLO family, which builds on top of YOLOR. We've taken a look at how to install the repository on your local machine and run object detection inference scripts with a pre-trained network on videos and images. In further guides, we'll be covering keypoint detection and instance segmentation.
Going Further - Practical Deep Learning for Computer Vision
Your inquisitive nature makes you want to go further? We recommend checking out our Course: "Practical Deep Learning for Computer Vision with Python".
Another Computer Vision Course?
We won't be doing classification of MNIST digits or MNIST fashion. They served their part a long time ago. Too many learning resources are focusing on basic datasets and basic architectures before letting advanced black-box architectures shoulder the burden of performance. We want to focus on demystification, practicality, understanding, intuition and real projects. Want to learn how you can make a difference? We'll take you on a ride from the way our brains process images to writing a research-grade deep learning classifier for breast cancer to deep learning networks that "hallucinate", teaching you the principles and theory through practical work, equipping you with the know-how and tools to become an expert at applying deep learning to solve computer vision.
- The first principles of vision and how computers can be taught to "see"
- Different tasks and applications of computer vision
- The tools of the trade that will make your work easier
- Finding, creating and utilizing datasets for computer vision
- The theory and application of Convolutional Neural Networks
- Handling domain shift, co-occurrence, and other biases in datasets
- Transfer Learning and utilizing others' training time and computational resources for your benefit
- Building and training a state-of-the-art breast cancer classifier
- How to apply a healthy dose of skepticism to mainstream ideas and understand the implications of widely adopted techniques
- Visualizing a ConvNet's "concept space" using t-SNE and PCA
- Case studies of how companies use computer vision techniques to achieve better results
- Proper model evaluation, latent space visualization and identifying the model's attention
- Performing domain research, processing your own datasets and establishing model tests
- Cutting-edge architectures, the progression of ideas, what makes them unique and how to implement them
- KerasCV - a WIP library for creating state of the art pipelines and models
- How to parse and read papers and implement them yourself
- Selecting models depending on your application
- Creating an end-to-end machine learning pipeline
- Landscape and intuition on object detection with Faster R-CNNs, RetinaNets, SSDs and YOLO
- Instance and semantic segmentation
- Real-Time Object Recognition with YOLOv5
- Training YOLOv5 Object Detectors
- Working with Transformers using KerasNLP (industry-strength WIP library)
- Integrating Transformers with ConvNets to generate captions of images
- Deep Learning model optimization for computer vision