Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". From it, keypoint detection (oftentimes used for pose estimation) was extracted. Keypoints can be various points - parts of a face, limbs of a body, etc. Pose estimation is a special case of keypoint detection - in which the points are parts of a human body. Pose estimation is an amazing, extremelly fun and practical usage of computer vision. With it, we can do away with hardware used to estimate poses (motion capture suits), which are costly and unwieldy. Additionally, we can map the movement of humans to the movement of robots in Euclidean space, enabling fine precision motor movement without using controllers, which usually don't allow for higher levels of precision. Keypoint estimation can be used to translate our movements to 3D models in AR and VR, and increasingly is being used to do so with just a webcam. Finally - pose estimation can help us in sports and security.
In this guide, we'll be performing real-time pose estimation from a video in Python, using the state-of-the-art YOLOv7 model.
Specifically, we'll be working with a video from the 2018 winter olympics, held in South Korea's PyeongChang: Aljona Savchenko and Bruno Massot did an amazing performance, including overlapping bodies against the camera, fast fluid movement and spinning in the air. It'll be an amazing opportunity to see how the model handles difficult-to-infer situations!
YOLO and Pose Estimation
YOLO (You Only Look Once) is a methodology, as well as family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years. Ultralytics' YOLOv5 is an industry-grade object detection repository, built on top of the YOLO method. It's implemented in PyTorch, as opposed to C++ for previous YOLO models, is fully open source, and has a beautifully simple and powerful API that lets you infer, train and customize the project flexibly. It's such a staple that most new attempts at improving the YOLO method build on top of it. This is how YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author) were created as well! YOLOv7 isn't just an object detection architecture - it provides new model heads, that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn't standard with previous YOLO models. This isn't surprising, since many object detection architectures were repurposed for instance segmentation and keypoint detection tasks earlier as well, due to the shared general architecture, with different outputs depending on the task.
<div class="alert alert-reference"> <div class="flex"> <strong>Advice:</strong> If you're interested in reading more about instance segmentation, read our "Instance Segmentation with YOLOv7 in Python"! </div> </div> Even though it isn't surprising - supporting instance segmentation and keypoint detection will likely become the new standard for YOLO-based models, which have begun outperforming practically all other two-stage detectors a couple of years ago in terms of both accuracy and speed.
This makes instance segmentation and keypoint detection faster to perform than ever before, with a simpler architecture than two-stage detectors.
<div class="alert alert-reference"> <div class="flex"> YOLOv7 was released alongside a paper named <a rel="nofollow noopener noreferrer" target="_blank" href="https://arxiv.org/abs/2207.02696">"<em>YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors"</em></a>, and the source code is available <a rel="nofollow noopener noreferrer" target="_blank" href="https://github.com/WongKinYiu/yolov7">on GitHub</a>. </div> </div> The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.
Let's start by cloning the repository to get ahold of the source code:
! git clone https://github.com/WongKinYiu/yolov7.git
Now, let's move into the
yolov7 directory, which contains the project, and take a look at the contents:
%cd yolov7 !ls
/content/yolov7 cfg figure output.mp4 test.py data hubconf.py paper tools deploy inference README.md train_aux.py detect.py LICENSE.md requirements.txt train.py export.py models scripts utils
!cd dirnamemoves you into a directory in that cell. Calling
%cd dirnamemoves you into a directory across the upcoming cells as well and keeps you there.
Now, YOLO is meant to be an object detector, and doesn't ship with pose estimation weights by dedfault. We'll want to download the weights and load a concrete model instance from them. The weights are available on the same GitHub repository, and can easily be downloaded through the CLI as well:
! curl -L https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6-pose.pt -o yolov7-w6-pose.pt % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 153M 100 153M 0 0 23.4M 0 0:00:06 0:00:06 --:--:-- 32.3M
Once downloaded, we can import the libraries and helper methods we'll be using:
import torch from torchvision import transforms from utils.datasets import letterbox from utils.general import non_max_suppression_kpt from utils.plots import output_to_keypoint, plot_skeleton_kpts import matplotlib.pyplot as plt import cv2 import numpy as np
Great! Let's get on with loading the model and creating a script that lets you infer poses from videos with YOLOv7 and OpenCV.
Real-Time Pose Estimation with YOLOv7
Let's first create a method to load the model from the downloaded weights. We'll check what device we have available (CPU or GPU):
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") def load_model(): model = torch.load('yolov7-w6-pose.pt', map_location=device)['model'] # Put in inference mode model.float().eval() if torch.cuda.is_available(): # half() turns predictions into float16 tensors # which significantly lowers inference time model.half().to(device) return model model = load_model()
Depending on whether we have a GPU or not, we'll turn half-precision on (using
float16 instead of
float32 in operations), which makes inference significantly faster. Note that it's highly encouraged to perform this on a GPU for real-time speeds, as CPUs will likely lack the power to do so unless running on small videos.
Let's write a convinience method for running inference. We'll accept images as NumPy arrays (as that's what we'll be passing them later while reading the video). First, using the
letterbox() function - we'll resize and pad the video to a shape that the model can work with. This doesn't need to be and won't be the shape (resolution) of the resulting video!
Then, we'll apply the transforms, convert the image to half precision (if a GPU is available), batch it and run it through the model:
def run_inference(image): # Resize and pad image image = letterbox(image, 960, stride=64, auto=True) # shape: (567, 960, 3) # Apply transforms image = transforms.ToTensor()(image) # torch.Size([3, 567, 960]) if torch.cuda.is_available(): image = image.half().to(device) # Turn image into batch image = image.unsqueeze(0) # torch.Size([1, 3, 567, 960]) with torch.no_grad(): output, _ = model(image) return output, image
We'll return the predictions of the model, as well as the image as a tensor. These are "rough" predictions - they contain many activations that overlap, and we'll want to "clean them up" using Non-Max Supression, and plot the predicted skeletons over the image itself:
def draw_keypoints(output, image): output = non_max_suppression_kpt(output, 0.25, # Confidence Threshold 0.65, # IoU Threshold nc=model.yaml['nc'], # Number of Classes nkpt=model.yaml['nkpt'], # Number of Keypoints kpt_label=True) with torch.no_grad(): output = output_to_keypoint(output) nimg = image.permute(1, 2, 0) * 255 nimg = nimg.cpu().numpy().astype(np.uint8) nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR) for idx in range(output.shape): plot_skeleton_kpts(nimg, output[idx, 7:].T, 3) return nimg
With these in place, our general flow will look like:
img = read_img() outputs, img = run_inference(img) keypoint_img = draw_keypoints(output, img)
To translate that to a real-time video setting - we'll use OpenCV to read a video, and run this process for every frame. On each frame, we'll also write the frame into a new file, encoded as a video. This will necessarily slow down the process as we're running the inference, displaying it and writing - so you can speed up the inference and display by avoiding the creation of a new file and writing to it in the loop:
def pose_estimation_video(filename): cap = cv2.VideoCapture(filename) # VideoWriter for saving the video fourcc = cv2.VideoWriter_fourcc(*'MP4V') out = cv2.VideoWriter('ice_skating_output.mp4', fourcc, 30.0, (int(cap.get(3)), int(cap.get(4)))) while cap.isOpened(): (ret, frame) = cap.read() if ret == True: frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) output, frame = run_inference(frame) frame = draw_keypoints(output, frame) frame = cv2.resize(frame, (int(cap.get(3)), int(cap.get(4)))) out.write(frame) cv2.imshow('Pose estimation', frame) else: break if cv2.waitKey(10) & 0xFF == ord('q'): break cap.release() out.release() cv2.destroyAllWindows()
VideoWriter accepts several parameters - the output filename, the FourCC (four codec codes, denoting the codec used to encode the video), the framerate and the resolution as a tuple. To not guess or resize the video - we've used the width and height of the original video, obtained through the
VideoCapture instance that contains data about the video itself, such as the width, height, total number of frames, etc.
Now, we can call the method on any input video:
This will open up an OpenCV window, displaying the inference in real-time. And also, it'll write a video file in the
yolov7 directory (since we've
cd'd into it):
Note: If your GPU is struggling, or if you want to embedd the results of a model like this into an application that has latency as a crucial aspect of the workflow - make the video smaller and work on smaller frames. This is a full HD 1920x1080 video, and should be able to run fast on most home systems, but if it doesn't work as well on your system, make the image(s) smaller.
In this guide, we've taken a look at the YOLO method, YOLOv7 and the relationship between YOLO and object detection, pose estimation and instance segmentation. We've then taken a look at how you can easily install and work with YOLOv7 using the programmatic API, and created several convinience methods to make inference and displaying results easier. Finally, we've opened a video using OpenCV, ran inference with YOLOv7, and made a function for performing pose estimation in real-time, saving the resulting video in full resolution and 30FPS on your local disk.
Going Further - Practical Deep Learning for Computer Vision
Your inquisitive nature makes you want to go further? We recommend checking out our Course: "Practical Deep Learning for Computer Vision with Python".
Another Computer Vision Course?
We won't be doing classification of MNIST digits or MNIST fashion. They served their part a long time ago. Too many learning resources are focusing on basic datasets and basic architectures before letting advanced black-box architectures shoulder the burden of performance. We want to focus on demystification, practicality, understanding, intuition and real projects. Want to learn how you can make a difference? We'll take you on a ride from the way our brains process images to writing a research-grade deep learning classifier for breast cancer to deep learning networks that "hallucinate", teaching you the principles and theory through practical work, equipping you with the know-how and tools to become an expert at applying deep learning to solve computer vision.
- The first principles of vision and how computers can be taught to "see"
- Different tasks and applications of computer vision
- The tools of the trade that will make your work easier
- Finding, creating and utilizing datasets for computer vision
- The theory and application of Convolutional Neural Networks
- Handling domain shift, co-occurrence, and other biases in datasets
- Transfer Learning and utilizing others' training time and computational resources for your benefit
- Building and training a state-of-the-art breast cancer classifier
- How to apply a healthy dose of skepticism to mainstream ideas and understand the implications of widely adopted techniques
- Visualizing a ConvNet's "concept space" using t-SNE and PCA
- Case studies of how companies use computer vision techniques to achieve better results
- Proper model evaluation, latent space visualization and identifying the model's attention
- Performing domain research, processing your own datasets and establishing model tests
- Cutting-edge architectures, the progression of ideas, what makes them unique and how to implement them
- KerasCV - a WIP library for creating state of the art pipelines and models
- How to parse and read papers and implement them yourself
- Selecting models depending on your application
- Creating an end-to-end machine learning pipeline
- Landscape and intuition on object detection with Faster R-CNNs, RetinaNets, SSDs and YOLO
- Instance and semantic segmentation
- Real-Time Object Recognition with YOLOv5
- Training YOLOv5 Object Detectors
- Working with Transformers using KerasNLP (industry-strength WIP library)
- Integrating Transformers with ConvNets to generate captions of images
- Deep Learning model optimization for computer vision