GPU-Accelerated Vision for Robots: Improving System Throughput Using OpenCV and CUDA

OpenCV is an open source computer vision and machine learning library for C/C++/Python available for Windows, Linux, macOS, and Android platforms. It contains low-level image processing functions as well as high-level algorithms such as object identification, face recognition, and action classification in videos. OpenCV has become very popular, with more than 47,000 people in its user community and 18 million downloads (see https://opencv.org/about/). Under a Berkeley Software Distribution (BSD) license, it can be used for both academic and commercial applications.

A significant part of computer vision is image processing, with massive parallel computations. Modern graphics processing units (GPUs) are highly parallel, multicore systems, powerful enough to perform general purpose computations on large blocks of data. So it is challenging, yet potentially very rewarding, to accelerate OpenCV on graphics processors.

Background
CUDA is a parallel computing architecture created by Nvidia that makes it possible to use the many computing cores in a GPU to perform general-purpose mathematical calculations [1]. However, it only works on Nvidia cards.
OpenCV and CUDA have been available for more than 10 years [2], and their use has increased significantly; however,  their combined application is not so widespread. Considering that a GPU/CUDA module for OpenCV has been available since 2010, the number of works using both libraries published in IEEE Xplore is relatively small and has grown very slowly. Figure 1 depicts the number of references in IEEE Xplore citing CUDA, OpenCV, or both. In 2018 only seven references for both are found, compared with 240 for CUDA and 180 for OpenCV.

GPU-Accelerated GPU-Accelerated GPU-Accelerated GPU-Accelerated GPU-Accelerated GPU-Accelerated GPU-Accelerated GPU-Accelerated GPU-Accelerated Vision for Robots Vision for Robots Vision for Robots Vision for Robots Vision for Robots Vision for Robots Vision for Robots Vision for Robots Vision for Robots
The aim of this article is to describe how the CUDA module for OpenCV works, with some examples of well-known vision problems documented with source code, in order to en courage more robotics researchers to migrate their applications toward GPU computation.
The usefulness of CUDA in robotics and vision has been successfully demonstrated with significant speedups in many applications [3]- [6]. However, it introduces an overhead due to the need to transfer data between the CPU and GPU spaces because most GPU processors work in a dedicated memory, independent from the system memory of the CPU. Consequently, image data need to be moved back and forth between the different types of memory for processing in the GPU. The processing flow consists of the following steps: 1) upload data from main memory to GPU memory 2) initiate the GPU computing kernel 3) perform parallel computation in the GPU's cores 4) download the resulting data from GPU memory to main memory.

The OpenCV CUDA Module
In the following example, we assume that the reader is familiar with OpenCV and C++ programming (for novices, an introduction is provided in [7]). Unless otherwise stated, the code snippets are based on OpenCV 3.4.0, but they can be easily adapted to earlier (2.4) or later (4.x) versions.
In the OpenCV library, all the classes and functions are defined in the name space cv. The main object is the class cv::Mat, which is essentially a matrix holding pixel values of an image. The GPU modules in OpenCV define a class cv::cuda::GpuMat, which is a container for image data kept in GPU memory, with a very similar interface to its CPU counterpart.
Let's consider a quick example with a color image that is converted into gray and binarized with a fixed threshold. In the CPU version, the source image src is first converted to an intermediate gray image src_gray, which is then thresholded into the resulting image dst. We need to define the variables (line 1) and call the OpenCV functions cv::cvtColor and cv::threshold (lines 2-3) for executing the task: This processing flow is depicted in Figure 2, where all the data are stored in the CPU memory, and all the operations are performed by the CPU.
In the GPU version, in addition to the variables for the initial and destination images (line 1), we need some new variables for processing the data in the GPU memory (line 2); the intermediate image src_gray is also stored in the GPU memory for minimizing data transfers: The processing task is performed by the equivalent functions of the OpenCV CUDA module c v : : c u d a : : cvtColor and cv::cuda::threshold. First, the image is transferred from CPU to GPU memory (line 3); then, the 1 cv::Mat src, dst; 2 cv::cuda::Mat gpu_src, gpu_dst, src_gray; 3 gpu_src.upload( src ) 4 cv::cuda::cvtColor( gpu_src, src_gray, cv::COLOR_BGR2GRAY ); 5 cv::cuda::threshold( src_gray, gpu_dst, 128, 255, cv::THRESH_ BINARY ); 6 gpu_dst.download( dst );   processing steps are executed (lines 4-5); and finally, the resulting image is transferred from the GPU back to CPU memory (line 6). The processing flow is depicted in Figure 3, where the CPU and GPU memory spaces and the different processing steps are represented.
There is an inherent overhead in the GPU processing flow due to the transfer of the images between the CPU and GPU memories. Such overhead can be minimized if all the processing operations are performed in the GPU and only the initial and final images are transferred: .
Let's define TCPU and TGPU as the computation times of the image processing operations (cvtColor, threshold) at the CPU and GPU, respectively. A speed gain will be obtained if and only if These computation times depend mainly on two factors: • Hardware technology of the respective boards: OpenCV is highly optimized for CPUs with multiple cores and vector instructions. • Degree of parallelization of the processing algorithms: Some vision operations may benefit more than others from the use of multiple cores in the GPU.

Image Processing Applications
In the following, we elaborate on four examples of image processing applications [edge detection, feature extraction, optical flow, and object detection with deep neural networks (DNNs)] that use OpenCV with a CPU and GPU in different hardware configurations. The first example is a simple edge detection application with the well-known Canny algorithm [8]. The CPU version of the application is as follows: Besides the initial and final images defined in line 1, three more variables are created in line 5 for storing the intermediate images. The algorithm parameters are defined in lines 2-4, and the processing steps (converting to gray, blurring, and computing the edges) are executed in lines 6-8. Finally, the edges are used as a pixel mask for copying the original image to the destination image in line 9.
To measure the average computing times of the algorithm, we processed the frames of a benchmarking video on three different hardware configurations of a CPU and GPU: • Desktop PC, with a CPU Intel Core i7-6700 at 3.4 GHz and a GPU GeForce GTX 1080 • Laptop PC, with a CPU Intel Core i7-8550U at 3.3 GHz and a GPU GeForce GTX 1050 • Embedded PC, an NVIDIA Jetson Nano with an ARM-A57 processor and an integrated GPU. The video consists of 50-s footage from a car in a highway available at Udacity's Advanced Line Finding Project (https:// git hub.com/udacit y/ C a r N D -A d v a n c e d -Lane-Lines), recorded at 25 Hz with a resolution of 1,280 x 720 24-b red, green, and blue. The main specifications of the GPUs for the three systems are presented in Table 1. The desktop PC features the most powerful CPU both in terms of processing cores and transfer speed, but it also requires more energy compared to the laptop and embedded PCs, which are adequate for mounting on a small robotic platform.
The source code with instructions for compilation and execution is publicly available (https://github.com/ RobInLabUJI/opencv-cuda). For the sake of reproducibility, we use docker (https://www.docker.com), a Linux container technology that offers some advantages for an easy replication of code: encapsulation, isolation, portability, and control. In addition, containers have less overhead than virtual machines, and they can access the GPU transparently (usually the impact is on the order of less than 1% and hardly noticeable). As a downside, the GPU-enabled version of docker (Nvidia-docker) does not yet support Windows or macOS.
The code can also be compiled and executed natively in a Linux computer (as long as all the requirements are previously installed-basically, OpenCV and CUDA) with the typical building commands: mkdir -p build cd build cmake . . make The results are shown in Table 2. They measure the mean and standard deviation of the execution time for 1,200 frames in the video (the initial 60 frames are skipped to avoid initialization delays). The execution time is measured starting from the first call to processing functions and continues until the final result is returned; this result is averaged with a moving window of 30 frames. For the GPU cases, the measured time includes the uploading of the initial image to GPU memory and the downloading of the result image back to CPU memory. Visual information [edges, object request broker (ORB) keypoints, and optical flow] is included in the measured code for clarity and debugging purposes, although in a real setup it could be removed to increase the throughput.
It is worth noting that, for this application, the CPUs are faster than the GPUs in all three systems; edge detection is a relatively simple computation, and the execution time is small compared with the overhead of transferring the images into the GPU memory.   The OpenCV functions cv::getTickCount() and cv::getTickFrequency are used to obtain the number of ticks before and after the processing work, translated into seconds. A boolean variable indicates whether to use the CPU or GPU; the value of this variable can be toggled through a keyboard click. In a second example, ORB features are detected and extracted from the image. Such features are very important in robotics applications, such as, for visual SLAM [9]. The source code for the CPU version is as follows: First, we define the necessary variables for storing the original and final images, the intermediate gray image, and the structures for storing the keypoints and descriptors of the ORB features (lines 1-4). Second, the feature detector is initialized with default parameters in line 4. Finally, the processing steps are performed in lines 5-7; the original color image is converted into a gray image, the keypoints are detected, and their descriptors are computed. In line 8, the keypoints are drawn into the destination image for visualization.
One should note that the processing code is basically similar to the previous version; lines 4-7 of the CPU code and lines 6-9 of the GPU code differ only in the use of the namespace cv::cuda instead of cv for the class ORB (line 4/6) and the functions ORB::create and cvtColor (lines 4/6 and 5/7).
Finally, drawing the keypoints is done in exactly the same way (line 10 of the GPU code is the same as line 8 of the CPU code). The output of the ORB detector is shown in Figure 5.
For debugging purposes, the code examples include visualization, and the corresponding function calls have been included in the benchmarking. Since the visualization process uses the same function call in both the CPU and GPU versions, it should not affect the difference between them in terms of performance.
The results are shown in Table 3. In this case, the GPUs are faster than the CPUs due to the increased computational workload demanded by the ORB algorithm. For simplicity, this example has not computed the matching of ORB  features, but it is possible to use either the CPU or the GPU for that purpose with the classes cv::DescriptorMatcher and cv::cuda::DescriptorMatcher, respectively. In the third example, we compute the dense optical flow with the Farneback algorithm [10]. The source code for the CPU version is as follows: 1 cv::Mat src, dst; 2 cv::Mat prev, cv::Mat next; 3 cv::Mat flow(prev.size(), CV_32FC2); 4 cv::cvtColor(src, next, cv::COLOR_BGR2GRAY); 5 cv::calcOpticalFlowFarneback(prev, next, flow, 0.5, 3, 15, 3, 5, 1.2, 0); Since optical flow is computed with the difference between the current and previous frames, we need to define additional variables in line 2 to store the frames. We also define a matrix of float numbers f l o w for the result [in line 3, C V _ 3 2 F C 2 means a 2-channel (complex) floating-point array]. This flow matrix contains the gradient of the movement between two frames; for each pixel location in the original frame, the channels contain dx and dy, so that prev_x + dx = next_x, and prev_y + dy = next_y.
The computation steps are quite simple: the color image is converted into a gray image (line 4), and the optical flow algorithm is executed (line 5). For the sake of simplicity, we have omitted additional instructions for displaying the result and storing the frames.
The output of the optical flow algorithm is displayed in Figure 6. The hue of each pixel block represents the orientation of the optical flow vector at that point, and the intensity is proportional to the magnitude of the flow. The results are shown in Table 4. As in the previous example, the execution times for the GPUs are lower than for the CPUs, since computing dense optical flow is a demanding operation.
Finally, we test the DNN module for OpenCV. Since version 3.1, there is a DNN module in the library that implements forward pass (inferencing) with networks pretrained using some popular deep-learning frameworks such as Caffe [11] or TensorFlow [12]. A backend for CUDA was added in OpenCV 4.2.0. In this example, we use the YOLO v3 network [13], a state-of-the-art, real-time object-detection system.
While the details of the OpenCV DNN module are beyond the scope of this article, its design is based on a unique interface that runs on different backends and computation devices (CPU, OpenCL, and CUDA). Consequently, the source code is exactly the same, no matter if the CPU or GPU is used, except for the parameters that select the appropriate backend and computation target. The values for using the CPU are as follows: net.setPreferableBackend(cv.dnn.DNN_BACKEND_OPENCV); net.setPreferableTarget (cv.dnn.DNN_TARGET_CPU); The GPU can be selected with net.setPreferableBackend(cv.dnn.DNN_BACKEND_CUDA); net.setPreferableTarget (cv.dnn.DNN_TARGET_CUDA); A typical output image from the DNN module is shown in Figure 7, where several cars are correctly identified in the input image. The frame rate for the CPU and GPU versions  running on the three types of computers used in the tests is shown in Table 5.
In addition to absolute timings, it is illustrative to calculate the speedup of the GPU with respect to the CPU: in other words, how much faster than the CPU is the GPU for a given application. The speedup results are shown in Figure 8, which displays the values for each application (edge detection, ORB features, optical flow, and DNNs) on each platform (desktop, laptop, and embedded PC).
The overhead penalty can be noticed for the edge detection application on every platform. On the other hand, the speedup is higher in the other applications, skyrocketing in the last example (DNN for object recognition). This result is not surprising, since CUDA is used intensively by the deeplearning community. But the benefits of using the GPU with other OpenCV functions cannot be overlooked, obtaining speedups of 580% and 290% for the computation of the optical flow in the desktop and laptop PCs, respectively.

OpenCV in ROS Projects
OpenCV is a widely used library in robotics projects; consequently, there is a precompiled module for the Robot Operating System (ROS) [14] (http://wiki.ros.org/opencv3); this, unfortunately, does not include CUDA support. However, GPU acceleration can still be used by replacing the standard module with a CUDA-enabled version of the OpenCV library. This can be done by installing the library and setting the appropriate path in the file CMakeLists.txt of any ROS module using OpenCV: In addition, all other ROS packages that are dependent upon OpenCV (cv_bridge, image_pipeline, image_ transport, etc.) must be rebuilt; that is, their source code must be downloaded into an ROS workspace and compiled with catkin_make. A complete example of a simple subscriber is presented in the "ros" branch of the source code repository of this article at https://github.com/RobInLabUJI/ opencv-cuda/tree/ros.
Converting an ROS topic image to a CUDA image is straightforward; the topic message is converted to an OpenCV image, and this image is uploaded to the GPU: cv_ptr = cv_bridge::toCvCopy(msg, sensor_msgs::image_ encodings::BGR8); gpuInImage.upload(cv_ptr->image);    Once the image is uploaded, the processing can be done as usual, and the result can be downloaded and converted into an ROS message.

Conclusions
CUDA for OpenCV is an easy solution for accelerating vision applications in robotics on systems equipped with a CUDAenabled GPU. The migration of the code from CPU-based to GPU-based is simple and relatively straightforward, even trivial in some cases. The speedup that can be achieved is systemand problem-dependent. For simple vision algorithms, modern CPUs can be faster; for complex problems involving a sequence of operations on the image, parallelization in the GPU leads to better performance; and for deep-learning applications, the improvement is significant. We provided some examples with well-known algorithms that are widely used by the robotics community, with the aim of encouraging researchers to improve the throughput of their systems by squeezing all the computing power out of their hardware. CUDA and other computing frameworks (DirectCompute [15] and OpenCL [16]) have become programming standards for parallel computing, and their inclusion in popular libraries like OpenCV is an opportunity for developers to benefit from parallelization without a significant investment in learning specific parallel programming techniques.
An advantage of an open framework such as OpenCL over CUDA is that it is supported by both AMD and Nvidia cards. The interested reader can refer to [17] for details about using OpenCL in OpenCV.