On the benefits of the remote GPU virtualization mechanism: The rCUDA case

Graphics processing units (GPUs) are being adopted in many computing facilities given their extraordinary computing power, which makes it possible to accelerate many general purpose applications from different domains. However, GPUs also present several side effects, such as increased acquisition costs as well as larger space requirements. They also require more powerful energy supplies. Furthermore, GPUs still consume some amount of energy while idle, and their utilization is usually low for most workloads. In a similar way to virtual machines, the use of virtual GPUs may address the aforementioned concerns. In this regard, the remote GPU virtualization mechanism allows an application being executed in a node of the cluster to transparently use the GPUs installed at other nodes. Moreover, this technique allows to share the GPUs present in the computing facility among the applications being executed in the cluster. In this way, several applications being executed in different (or the same) cluster nodes can share 1 or more GPUs located in other nodes of the cluster. Sharing GPUs should increase overall GPU utilization, thus reducing the negative impact of the side effects mentioned before. Reducing the total amount of GPUs installed in the cluster may also be possible. In this paper, we explore some of the benefits that remote GPU virtualization brings to clusters. For instance, this mechanism allows an application to use all the GPUs present in the computing facility. Another benefit of this technique is that cluster throughput, measured as jobs completed per time unit, is noticeably increased when this technique is used. In this regard, cluster throughput can be doubled for some workloads. Furthermore, in addition to increase overall GPU utilization, total energy consumption can be reduced up to 40%. This may be key in the context of exascale computing facilities, which present an important energy constraint. Other benefits are related to the cloud computing domain, where a GPU can be easily shared among several virtual machines. Finally, GPU migration (and therefore server consolidation) is one more benefit of this novel technique.

Example of a graphics processing unit (GPU)-accelerated cluster. CPU indicates central processing unit FIGURE 2 Logical configuration of a cluster when the remote graphics processing unit (GPU) virtualization technique is used. CPU indicates central processing unit Xeon processors and 1 NVIDIA Tesla GPU. Additionally, the interconnection network could be an FDR InfiniBand fabric. Notice, however, that using GPUs in such a configuration is not exempt from side effects. For example, let us consider the execution of a distributed Message Passing Interface (MPI) application, which does not require the use of GPUs. Typically, this application will spread across several nodes of the cluster, thus flooding the CPU cores available in them. In this scenario, the GPUs in the nodes involved in the execution of such an MPI application would become unavailable for other applications because all the CPU cores in those nodes would be devoted to the nonaccelerated MPI application. This would cause that those GPUs remain idle for some periods of time, thus reducing overall GPU utilization and making that the initial economic investment done during cluster acquisition requires more time to be amortized.
Another example of the concerns associated with the use of GPUs in clusters is related to the way that job schedulers such as Slurm 8 perform the accounting of resources in a cluster. These job schedulers use a fine granularity for resources such as CPUs or memory but not for GPUs. For instance, job schedulers can assign CPU resources in a per-core basis, thus being able to share the CPU sockets present in a server among several applications. In the case of memory, job schedulers can also assign, in a shared approach, the memory present in a given node to the several applications that will be concurrently executed in that server. However, in the case of GPUs, job schedulers use a per-GPU granularity. In this regard, GPUs are assigned to applications in an exclusive way. Hence, a GPU cannot be shared among several applications even when it has enough resources to allow the concurrent execution of those applications on it, causing that overall GPU utilization is, in general, low. This fact not only reduces the effective computing power of clusters but also causes that a nonnegligible amount of energy is wasted, being both aspect's key concerns in the context of exascale computing.
For us to address some of the side effects related to the use of GPUs, the remote GPU virtualization mechanism could be used. This software mechanism allows an application being executed in a computer, which does not own a GPU to transparently make use of accelerators installed in other nodes of the cluster. The remote GPU virtualization technique allows physical GPUs to be logically detached from nodes, thus allowing that decoupled GPUs are concurrently shared by all the nodes of the computing facility in a transparent way to applications. Figure 2 shows the new cluster envision after applying the remote GPU virtualization mechanism. In the new cluster configuration, GPUs are logically detached from nodes and a pool of GPUs is created. Graphics processing units in this pool can be accessed from any node in the cluster. Furthermore, a given GPU may concurrently serve more than 1 application. This sharing of GPUs not only increases overall GPU utilization but also allows to create cluster configurations where not all the nodes in the cluster own a GPU at the same time that all the nodes in the cluster can execute GPU-accelerated applications. This cluster configuration would reduce the costs associated with the acquisition and later use of GPUs. In this regard, the total energy required to operate a computing facility may be decreased, thus loosening the big energy concerns of future exascale computing installations.
In this paper, we explore some of the benefits that the remote GPU virtualization mechanism provides to clusters. We present this exploration in the context of the Remote CUDA (rCUDA) middleware, given that it is the most modern remote GPU virtualization solution and also the one that provides the best performance, as it will be shown later in the paper. The rest of the paper is organized in the following way: Section 2 presents a review of the remote GPU virtualization technique. Later, Section 3 introduces the rCUDA technology in more detail, given that it will be the one used in this work to quantify the benefits of the remote GPU virtualization mechanism. Next, Section 4 introduces 6 of the benefits of this virtualization technique. Finally, Section 5 concludes the paper. Notice that this paper is an extension of a previous workshop paper. 9

REMOTE GPU VIRTUALIZATION SOLUTIONS
Frameworks such as CUDA 1 assist programmers in using GPUs for general-purpose computing. Several remote GPU virtualization solutions exist for this framework, such as GridCuda, 10 DS-CUDA, 11 gVirtuS, 12 vCUDA, 13 GViM, 14 and rCUDA. 15 Basically, these middleware proposals share a GPU by virtualizing it. In this way, these middleware solutions provide applications with virtual instances of the real device, which can therefore be concurrently shared. Usually, these GPU sharing solutions place the virtualization boundary at the application program interface (API) level (CUDA in the case of NVIDIA GPUs). In general, CUDA-based virtualization solutions aim to offer the same API as the NVIDIA CUDA Runtime API 16 does. Figure 3 depicts the architecture underlying most of these virtualization solutions, which follow a client-server distributed approach. The client part of the middleware is installed in the cluster node executing the application requesting GPU services, whereas the server side runs in the computer owning the actual GPU. In this way, the client receives a CUDA request from the accelerated application and appropriately processes and forwards it to the remote server. In the server node, the middleware receives the request and interprets and forwards it to the GPU, which completes the execution of the request and provides the execution results to the server middleware. In turn, the server sends back the results to the client middleware, which forwards them to the original application, which is not aware that its request has been served by a remote GPU instead of a local one. FIGURE 3 Organization of remote graphics processing unit (GPU) virtualization frameworks. API indicates application program interface; CUDA, Compute Unified Device Architecture Compute Unified Device Architecture-based GPU virtualization solutions may be classified into 2 types: (1) those intended to be used in the context of virtual machines (VMs) and (2) those devised as general-purpose virtualization solutions, to be used in native domains (notice that these latter solutions may also be used within VMs). Frameworks in the first category usually make use of shared-memory mechanisms to transfer data from the main memory inside the VM to the GPU in the native domain, whereas the general-purpose virtualization solutions in the second type make use of the network fabric in the cluster to transfer data from the main memory in the client side to the remote GPU located in the server. This is why these latter solutions are commonly known as remote GPU virtualization solutions.
Regarding the first type of GPU virtualization solutions mentioned above, several solutions have been developed to be specifically used within VMs, such as, for example, vCUDA, GViM, gVirtuS, and Shadowfax. 17  In the second type of virtualization solutions mentioned above, which provide general purpose GPU virtualization, one can find rCUDA, GridCuda, DS-CUDA, and Shadowfax II. 18  For a comprehensive comparison to be provided among the different GPU virtualization solutions described in this section, Figure 4 presents a performance analysis of 3 publicly available GPU virtualization solutions: DS-CUDA, rCUDA, and gVirtuS. This figure also shows the performance of CUDA as the baseline reference. The widely used bandwidthTest benchmark from the NVIDIA CUDA Samples 19 has FIGURE 4 Performance comparison among 3 publicly available CUDA GPU virtualization solutions: gVirtuS, DS-CUDA, and rCUDA. The comparison is performed in attained bandwidth. The performance of CUDA is also depicted for comparison purposes. CUDA indicates Compute Unified Device Architecture; GPU, graphics processing unit been used. The test bed used for conducting the performance experiments is based on 2 Supermicro servers as the ones described in Section 4. The bandwidth test (along with the client side of the different frameworks) was run in 1 of the computers whereas the server side of the middleware solutions was executed in another computer, which owns a Tesla K20 GPU. The InfiniBand FDR network technology was used to connect both computers. Therefore, both the rCUDA and DS-CUDA solutions made use of the InfiniBand Verbs API. In the case of gVirtuS, given that it is not able to take advantage of the InfiniBand Verbs API, TCP/IP over InfiniBand was used. Figure 4 deserve some discussion. First, it can be seen that CUDA achieves the highest performance when pinned memory is used ( Figure 4A,B), attaining a bandwidth around 6000 MB/s. Notice that this bandwidth is reduced to half for copies using pageable memory ( Figure 4C,D). Second, Figure 4 shows that rCUDA outperforms the other 2 remote GPU virtualization solutions. Actually, for copies using pageable memory, rCUDA also performs better than CUDA. This is a well-known effect thoroughly described in previous works on rCUDA 15 and is due to the use of an efficient pipelined communication based on the use of internal preallocated pinned memory buffers. On the other hand, notice that both rCUDA and DS-CUDA make use of the Infini-Band Verbs API, thus having access to the large bandwidth available in this interconnect. However, although rCUDA is able to struggle an important fraction of the available bandwidth, DS-CUDA presents a relatively poor performance. Therefore, it must be assumed that the difference in bandwidth is due to the different way that both GPU virtualization solutions manage the InfiniBand interconnect. Also, notice that DS-CUDA supports neither memory copies larger than 32 MB nor the use of pinned memory. On the other hand, notice that the performance of gVirtuS is extremely low. One may think that this is due to the fact that gVirtuS is using TCP/IP over InfiniBand, which clearly achieves lower performance than the InfiniBand Verbs API. However, according to our measurements with the iperf tool, 20 InfiniBand FDR provides around 1190 MB/s when TCP/IP over InfiniBand is used. This bandwidth is noticeably larger than the one attained by gVirtuS. Hence, the low performance of this middleware is not due to the use of TCP/IP over InfiniBand but to the way it internally manages communications.

Results in
As a final consideration for this review section, it is important to remark that although remote GPU virtualization has traditionally introduced a nonnegligible overhead, given that applications do not access GPUs attached to the local Peripheral Component Interconnect Express (PCIe) link but rather access devices that are installed in other nodes of the cluster (traversing a network fabric with a lower bandwidth), this performance overhead has significantly been reduced thanks to the recent advances in networking technologies as well as a careful design of the remote virtualization solution, as shown in Figure 4 for the rCUDA framework. The reader may refer to Reaño et al 21 for a deeper analysis.

RCUDA: REMOTE CUDA
As already mentioned in Section 1, we use in this study the rCUDA middleware given that it is the most up-to-date solution, providing also the best performance among the different publicly available GPU virtualization solutions, as shown in the previous section. Furthermore, it was the only framework able to run the applications analyzed in this paper.
In this section, we present rCUDA in more detail. Figure 5 depicts a detailed view of the architecture of the rCUDA middleware.
The rCUDA middleware supports version 7.5 of CUDA, being binary compatible with it, which means that CUDA programs do not need to be GPUs are assigned to the application, then the command "export RCUDA_DEVICE_COUNT=2" should be executed. The second environment variable, RCUDA_DEVICE_j, indicates, for each of the n GPUs assigned to the application, in which cluster node the GPU with identifier j is located. For instance, in the previous example, the commands "export RCUDA_DEVICE_0=192.168.0.1" and "export RCUDA_DEVICE_1=192.168.0.2" should be executed. Finally, the RCUDAPROTO environment variable sets the communication module to be used during the execution of the application. For instance, the command "export RCUDAPROTO=IB" should be used to leverage the InfiniBand Verbs API. In case of using the TCP/IP communication module, the command "export RCUDAPROTO=TCP" should be executed.

BENEFITS OF USING REMOTE GPU VIRTUALIZATION
In this section, we introduce 6 of the benefits that the remote GPU virtualization mechanism presents to clusters and applications. Namely, these benefits, which will be further described and analyzed in the next subsections, are the following ones: 1. More GPUs are available for a single application.
2. Busy CPU sockets in a server do not hinder the use of the GPUs at that server.
3. Cluster throughput is increased at the same time that energy consumption is reduced. Overall GPU utilization is also increased.
4. Cluster upgrades are made easier and cheaper just by attaching GPU servers to a non-GPU cluster. 5. Several VMs can concurrently access the same GPU in a shared manner.
6. Graphics processing unit jobs can be easily migrated across the cluster to consolidate them into fewer servers.
The next subsections describe and analyze these benefits. A performance evaluation is included for most of them. To that end, the test bed leveraged is based on the use of 1027GR-TRF Supermicro servers, each of them including 2 Intel Xeon E5-2620 v2 processors (

Benefit no. 1: more GPUs available for a single application
When using CUDA, an MPI application can be distributed across several nodes in the cluster to make use of the GPUs installed in those nodes. However, a parallel shared-memory application based on the use of threads can only run in a single node, and therefore, it can only benefit from the GPUs installed in that node. On the contrary, when rCUDA is leveraged, an application being executed in a single node can use all the GPUs in the cluster, thus boosting its performance.
In this case, the only limitation to increase application performance would be the ability of the programmer to code the application in the proper way so that it takes advantage of as many GPUs as they are available. Figure 6 shows the performance of the MontecarloMultiGPU Sample by NVIDIA when executed in a single node owning 4 GPUs with CUDA and also when executed in a cluster making use of up to 14 GPUs with rCUDA. The CUDA executions have been performed in a node on the basis of the Supermicro SYS7047GR-TRF server, populated with 4 NVIDIA Tesla K20 GPUs. Given that CUDA can only use the GPUs installed in the same node that is executing the application, only up to the 4 GPUs inside the Supermicro SYS7047GR-TRF server can be used for the CUDA executions. On the contrary, when rCUDA is used, many additional GPUs can be provided to the application. Figure 6 shows how the use of a larger amount of GPUs contributes to reduce total execution time. Notice also that for 1 and 2 GPUs, execution time with rCUDA is slightly lower than with CUDA. This is mainly due to the higher bandwidth attained by rCUDA for moving data to/from the GPU, as shown in Section 2, as well as the faster synchronization of rCUDA with respect to CUDA, as shown in 1 study. 22 On the other hand, Figure 7

Benefit no. 2: Busy CPUs in a server do not hinder the use of the GPUs at that server
Users of a cluster tend to require as many computing resources as possible for executing their applications to reduce application execution time. Requiring as many resources as possible may happen in several ways. For instance, it is quite common that users submitting a non-GPU-accelerated shared-memory parallel application to the job scheduler queues in the system request for their application as many CPU cores as available in the node. In practice, this requirement translates into the application using all the CPU cores of the cluster node where the application has been launched. In this way, during the execution of such an application, no other application can be executed in that node because of the lack of CPU cores.
In a similar way, users may also submit to the job scheduler queues requests for executing nonaccelerated hybrid MPI shared-memory applications. These applications span over several nodes of the cluster, usually flooding all the CPU cores present at each of the nodes. As in the previous example, during the time that 1 of these hybrid applications is being executed, no other application can fit into the nodes that the former application is using because there is no CPU core available.
Although the execution of the mentioned applications may lead to a reduced application execution time and therefore an overall high CPU utilization, when the nodes involved in their execution include 1 or more GPUs, these accelerators will remain idle while these nonaccelerated applications are being executed. The accelerators in those nodes become unavailable for other applications because, to use them, it is required to launch an application in those nodes. However, this is not possible because that application would require at least 1 available CPU core but all the CPU cores have been allocated to the FIGURE 7 Screenshot of the deviceQuery Sample by NVIDIA when used with rCUDA after assigning 64 graphics processing units to an application FIGURE 8 Graphics processing units in nodes 1 and 2 are not available because all the central processing unit cores at those nodes are busy with the execution of non-accelerated applications FIGURE 9 The remote graphics processing unit (GPU) virtualization mechanism allows GPUs in nodes with busy central processing unit cores to be used by applications being executed in other nodes of the cluster nonaccelerated application, and therefore, the job scheduler will not forward any application to that node. This condition is depicted in The remote GPU virtualization mechanism may be useful in the previous scenarios, as shown in Figure 9. When nonaccelerated applications block the use of the GPUs in 1 or several nodes of the cluster, frameworks such as rCUDA may allow to use the blocked GPUs by allocating them to applications being executed in other nodes of the cluster.
In this way, the free CPU cores that were missing in the previous scenarios will now be located in other nodes. The net result is that, in addition to increase overall CPU utilization, GPU utilization is also increased. On the other hand, remember that the rCUDA framework makes use of the rCUDA server to provide access to remote GPUs. This server, which is run as a demon, must be executed in 1 of the CPU cores of the node owning the GPU. Given that in the scenarios considered above, all the CPU cores in the nodes with blocked GPUs are being used for the execution of the nonaccelerated application, one may wonder whether the execution of the rCUDA server would introduce an important overhead, which in turn would penalize the execution time of the nonaccelerated application. Nevertheless, such an analysis is beyond the scope of this paper and, additionally, has already been addressed in 1 study. 15

Benefit no. 3: increased cluster throughput
When the remote GPU virtualization mechanism is used in a cluster, GPUs can be concurrently shared among several applications as far as there are enough memory resources available in the GPUs for the applications being executed. Additionally, given that a GPU can be used by applications being executed in a node other than the one where the GPU is installed, when all the CPU cores in the node owning the GPU are busy with a nonaccelerated application, the GPU can still be used from another cluster node, as described in benefit no. 2. These features contribute to a higher GPU utilization, what translates into an increased cluster throughput (measured in jobs per time unit) and a reduced energy consumption.
To quantify the benefits of these features, in this subsection, we study the impact that using the remote GPU virtualization mechanism has on the performance of a small cluster. To that end, we have executed several workloads in the cluster by submitting a series of randomly selected jobs to the Slurm queues. After job submission, several parameters have been measured, such as total execution time of the workloads, energy required to execute them, and GPU utilization. We have considered 2 different scenarios for executing the workloads. In the first one, the cluster uses CUDA, and therefore, applications can only use those GPUs installed in the same node where the application is being executed. In this scenario, an unmodified version of Slurm has been used. In the second scenario, we have made use of rCUDA, and therefore, an application being executed in a given node can use any of the GPUs available in the cluster. Moreover, we have modified Slurm 23 so that it is possible to schedule the use of remote GPUs. These 2 scenarios will allow us to compare the performance of a cluster using CUDA with that of a cluster using rCUDA. A 16-node cluster has been used for executing the workloads. The characteristics of the nodes are the same as the ones mentioned before (2 Xeon E5-2620 v2 sockets with 1 NVIDIA Tesla K20 GPU and 1 FDR InfiniBand adapter). One additional node (the 17th node) has been leveraged to execute the Slurm controller demon responsible for scheduling jobs (the slurmctld process).
Several workloads have been considered to provide a more representative range of results. The workloads are composed of the following applications (see Table 1): GPU-BLAST, 24 LAMMPS, 25 mCUDA-MEME, 26 GROMACS, 27 BarraCUDA, 28 MUMmerGPU, 29 GPU-LIBSVM, 30 and NAMD. 31   During execution, each of these threads will use a different CPU core. In a similar way, the NAMD application will be distributed across 4 different nodes of the cluster (4 processes) and 12 threads will be launched at each node. Therefore, the NAMD application will make use of 4 entire nodes. In a similar way, the GROMACS application will keep busy 2 entire nodes while being executed. Furthermore, as both the NAMD and GROMACS applications do not make use of GPUs, the concern mentioned in benefit no. 2 about the use of the accelerators will appear. mCUDA-MEME, and GROMACS require less than 170 seconds to complete execution (they are "short" applications) whereas Bar-raCUDA, MUMmerGPU, GPU-LIBSVM, and NAMD require more than 240 seconds to be executed ("long" applications).
In addition to execution time, Table 1 also shows the GPU memory required by each application. For those applications composed of several processes, the amount of GPU memory depicted in Table 1 refers to the individual needs of each particular process. Notice that the amount of GPU memory is not specified for the GROMACS and NAMD applications because we are using nonaccelerated versions of these applications. The reason for this choice is simply to increase the heterogeneity degree of the workloads by using some CPU-only applications, as it could be the case in many data centers.
In summary, the 8 applications used in this study present different characteristics, not only regarding the amount of processes and threads used by each of them and their execution time but also they present different GPU usage patterns, what includes both memory copies to/from GPUs and also kernel executions. Therefore, although the set of applications considered is finite, it may provide a representative sample of a workload typically found in current data centers.   Actually, the set of applications in Table 1 Table 1 becomes even more representative.  Table 3. Workload labeled as "Set 1" is composed of 400 instances randomly selected from applications GPU-Blast, LAMMPS, mCUDA-MEME, and GROMACS. The exact amount of instances for each application is shown in the table. Additionally, the exact sequence of the applications within the workload is also randomly set. In a similar way, workload labeled as "Set 2" is composed of 400 instances of applications BarraCUDA, MUMmerGPU, GPU-LIBSVM, and NAMD. Finally, a third workload, referred to as "Set 1+2," has been created with instances from all the applications. Figure 10 shows the performance results. Remember that a small cluster composed of 16 nodes with 1 GPU at each node is being used.
The figure shows, for each of the workloads depicted in Table 3, the performance when CUDA is used along with the original Slurm job scheduler (results labeled as "CUDA") as well as the performance when rCUDA is used in combination with the modified version of Slurm (label "rCUDA"). Figure 10A shows total execution time for each of the workloads. Figure 10B depicts the averaged GPU utilization for all the 16 GPUs in the cluster. Data for GPU utilization have been gathered by polling each of the GPUs in the cluster once every second and afterwards averaging all the samples after completing workload execution.
An in-house Python script based on the pyNVML library was used for polling the GPUs. In a similar way, Figure 10C shows the total energy required for completing workload execution. Energy has been measured by polling once every second the power distribution units (PDUs) present in the cluster. Used PDU units are APC AP8653 PDUs, which provide individual energy measurements for each of the servers connected to them. After workload completion, the energy required by all servers was aggregated to provide the measurements in Figure 10C.
As can be seen in Figure 10A, workload "Set 1" presents the smallest execution time, given that it is composed of the shortest applications.
Furthermore, using rCUDA reduces execution time for the 3 workloads.
In this regard, execution time is reduced by 48%, 37%, and 27% for workloads "Set 1," "Set 2," and "Set 1+2," respectively. Regarding GPU utilization, Figure 10B shows that the use of remote GPUs helps to increase overall GPU utilization. Actually, when rCUDA is used with "Set 1" and "Set 1+2," average GPU utilization is doubled with respect to the use of CUDA. Finally, total energy consumption is reduced accordingly, as shown in Figure 10C, by 40%, 25%, and 15% for workloads "Set 1," "Set 2," and "Set 1+2," respectively. These results about reducing energy are very important given that energy consumption is an important concern in current computing facilities and will be key in future exascale systems.
Several are the reasons for the benefits obtained when GPUs are shared across the cluster. First, as already mentioned, the execution of the nonaccelerated applications makes that GPUs in the nodes executing them remain idle when CUDA is used. This is the case for the GROMACS and NAMD applications, which span over 2 and 4 nodes, respectively, hindering the use of the GPUs at those nodes. On the contrary, when rCUDA is leveraged, these GPUs can be used by applications being executed in other nodes of the cluster. Notice that this remote usage of GPUs belonging to nodes with busy CPUs will be more frequent as cluster size increases because more GPUs will be blocked by nonaccelerated applications (also depending on the exact workload).
Another example is the execution of LAMMPS and mCUDA-MEME, which require 4 nodes with 1 GPU. While these applications are being executed with CUDA, those 4 nodes cannot be used by any other FIGURE 10 Performance results from the 16-node 16-GPU cluster. CUDA indicates Compute Unified Device Architecture; GPU, graphics processing unit application from Table 1; on the one hand, the other accelerated applications cannot access the GPUs in those nodes because they are busy and, on the other hand, the non-GPU applications (GROMACS and NAMD) cannot use those nodes because they require all the CPU cores and LAMMPS and mCUDA-MEME already took 1 core. However, when GPUs are shared among several applications, GPUs assigned to LAMMPS and mCUDA-MEME can concurrently be assigned to other applications that will run in any available CPU in the cluster, thus increasing overall throughput. This concurrent usage of the GPUs brings to a second cause for the improvements shown in Figure 10, as explained next.
The second reason for the improvements shown in Figure 10 is related to the usage that applications make of GPUs. As Table 1 showed, some applications do not completely exhaust GPU memory resources.
For instance, applications mCUDA-MEME and GPU-LIBSVM only use about 3% of the memory present in the NVIDIA Tesla K20 GPU. However, the original version of Slurm (combined with CUDA) will allocate the entire GPU for executing each of these applications, thus causing that almost 100% of the GPU memory is wasted during application execution. This concern is also present for other applications in Table 1.
Moreover, if NVIDIA Tesla K40 GPUs were used instead of the NVIDIA On the contrary, when rCUDA is used, GPUs can be shared among several applications provided that there is enough memory for all of them.
Obviously, GPU cores will have to be multiplexed among all those applications, what will cause that all of them execute slower. In this regard, Figure 11 presents the execution times for the GPU-accelerated applications in Table 1 when several instances of the same application are concurrently executed in a GPU*. Executions in Figure 11

Benefit no. 4: cheaper cluster upgrade
The use of GPUs in a cluster usually puts several burdens on the physical configuration of the nodes in the cluster. For instance, nodes owning a GPU need to include larger power supplies able to provide the energy required by the accelerators. Also, GPUs are not small devices, and *It is also possible to analyze concurrent executions when the applications concurrently using the GPU are different. However, using several instances of the same application generates a higher pressure on the system because all the instances will try to synchronously perform the same operations. therefore, they require a nonnegligible amount of space in the nodes where they are installed. These requirements make that installing GPUs in a cluster, which did not initially include them, is sometimes expensive (power supplies need to be upgraded) or simply impossible (nodes do not have enough physical space for the GPUs). However, the workload in some data centers may evolve towards the use of GPUs. At that point, the concern is how to address the introduction of GPUs in a computing facility that did not include accelerators at acquisition time.
One possible solution to the concern above is acquiring some amount of servers populated with GPUs and divert the execution of accelerated applications to those nodes. The Slurm workload manager would automatically take care of dispatching the GPU-accelerated applications to the new servers. However, although this approach is feasible, it presents the limitation that GPU jobs will probably have to wait for long until 1 of the GPU-enabled servers is available even though GPU utilization is usually low. Another concern is that accelerated MPI applications will only be able to span to as many nodes as GPU-enabled servers were acquired. Given these concerns, a better approach would be to acquire some amount of servers populated with GPUs and use rCUDA to execute accelerated applications at any of the nodes in the cluster while using the GPUs in the new servers. This solution would not only increase overall GPU utilization with respect to the use of CUDA in the previous scenario but also allow MPI applications to span to as many nodes as required because MPI processes would be able to remotely access GPUs thanks to rCUDA. In summary, the remote GPU virtualization mechanism allows clusters, which did not initially include GPUs to be easily and cheaply updated for using GPUs by attaching to them 1 or more computers containing GPUs. In this way, the original nodes will make use of the GPUs installed in the new nodes, which will become GPU servers. The modified version of Slurm would be used to schedule the use of the GPUs in the new servers.
To analyze the performance of these 2 possible solutions, we have substituted 1 of the nodes in the test bed cluster by a node containing 4 GPUs. This node is based on the Supermicro SYS7047GR-TRF server, populated with 4 NVIDIA Tesla K20 GPUs and 1 FDR InfiniBand network adapter. Furthermore, to additionally consider the use of parallel shared-memory applications to be able to increase the heterogeneity of the workloads, we have modified the workloads used in the previous experiments by modeling shared-memory applications with 2 and 4 threads that require 2 and 4 GPUs, respectively. To that end, 2 different flavors of the LAMMPS and mCUDA-MEME applications have been used, as shown in Table 4: (1) "LAMMPS long 2p" and "mCUDA-MEME long 2p" consist of 2 single-threaded processes that are forced to be executed in the same node. These instances of the applications will model the use of 2-thread shared-memory applications; (2) "LAMMPS long 4p" and "mCUDA-MEME long 4p" consist of 4 single-threaded processes that will be forced to execute in the same node. They will model the use of 4-thread shared-memory applications. One additional flavor of these applications will model single-thread shared-memory applications. This additional flavor is composed by the "LAMMPS short" and "mCUDA-MEME short" cases shown in Table 4, which make use of 1 single-threaded process. Furthermore, small input data sets are used for the "LAMMPS short" and "mCUDA-MEME short" cases whereas the multi-threaded flavors use a large input data set to lengthen their execution time.

Benefit no. 5: VMs can easily access GPUs
Providing CUDA acceleration to VMs is usually accomplished by making use of the PCI passthrough (PCI PT) technique. [33,34] This mechanism is based on the use of the virtualization extensions widely available in current high performance computing servers, which allow assigning a GPU, in an exclusive way, to 1 of the VMs running at the host. Moreover, when making use of this mechanism, the performance attained by accelerators is very close to that obtained when using the GPU in a native domain. Figure 13  Above the hypervisor, we can find the VMs (Dom0 and DomU i ). Notice that the Dom0 VM is a predefined VM using the Xen Linux kernel and behaves as the configuration and management interface to the hypervisor. The rest of the VMs (from DomU 1 to DomU n ) are unprivileged VMs that can be provided to users. Figure 13 shows how the Ethernet adapter and the GPU are provided to VMs. On the one hand, the Ethernet adapter is owned by the Dom0 VM, which provides connectivity to the rest of the VMs by using a software Ethernet switch, thus creating a virtual network among the VMs. On the other hand, the GPU is assigned to 1 of the VMs by making use of the PCI PT mechanism. For other hypervisors, such as the KVM one, the overall deployment is similar although the exact configuration details differ. The reader may refer to 1 study 35 for a complete discussion on the KVM case. FIGURE 12 Performance results when a server with 4 GPUs is attached to a 15-node cluster without GPUs. CUDA indicates Compute Unified Device Architecture; GPU, graphics processing unit FIGURE 13 Typical configuration of a Xen-based system showing how the Ethernet adapter and the GPU available in the host are provided to VMs. GPU indicates graphics processing unit; PCI PT, PCI passthrough; VM, virtual machine Unfortunately, the PCI PT approach assigns GPUs to VMs in an exclusive way, and therefore, it does not allow simultaneously sharing GPUs among the several VMs being concurrently executed at the same host.
In the case of Figure 13, VM DomU 1 is the only one that may access the GPU. The rest of the VMs hosted in that computer cannot make use of the accelerator until it is detached from DomU 1 . Moreover, it is important to remark that at that point, only 1 of the other VMs will be able to use the GPU. It is noteworthy for the small flexibility that this configuration provides regarding the use of GPUs, given that only 1 of the VMs can access the GPU.
For us to address the concern about the exclusive assignment nature of the PCI PT mechanism, there have been several attempts, like the one proposed in another study, 36 which dynamically changes on demand the GPUs assigned to VMs. However, these techniques present a high time overhead given that, in the best case, 2 seconds are required to change the assignment between GPUs and VMs. This issue constrains the use of GPUs in the cloud computing domain.
With the remote GPU virtualization mechanism, it is possible to concurrently assign a given GPU to several VMs, so that the applications being executed inside them can share the GPU resources. 37 Two different scenarios can be considered: one where VMs access a GPU located at the same host executing the VMs and another one where the Infini-Band fabric is already present in the cluster and therefore VMs access a GPU installed in another cluster node. Figure 14A depicts the first scenario whereas Figure 14B presents the second one. FIGURE 14 Test beds used in the experiments presented in this subsection, which make use of rCUDA to provide GPU access to VMs. A, In a single-node test bed, VMs use the virtual network to access the rCUDA server by means of the TCP/IP protocol stack. B, When an InfiniBand fabric is available, VMs use such interconnect to access a remote rCUDA server. GPU indicates graphics processing unit; rCUDA, Remote CUDA; TCP/IP, Transmission Control Protocol/ Internet Protocol; VM, virtual machine In the first scenario, 1 of the VMs will have exclusive access to the GPU by making use of the PCI PT mechanism. This VM will grant GPU access to the other VMs by using the rCUDA middleware: the rCUDA server will be executed in the VM owning the GPU whereas the other VMs will use the rCUDA client to access the GPU across the Xen virtual network. Transmission Control Protocol/Internet Protocol-based communications will be used in this scenario to communicate the rCUDA clients with the rCUDA server. Accordingly, VMs running the rCUDA client will have 1 or several virtual instances (vGPU) of the real GPU, which is physically connected to the VM DomU 1 . Moreover, the VM DomU 1 will be able to use either the real GPU or its virtual instances. Notice that the rCUDA server can only be installed in the DomU i VMs given that NVIDIA does not provide support for the Xen Linux kernel used in the Dom0 VM.
Regarding the second scenario, shown in Figure 14B, which uses the InfiniBand fabric already present in the cluster to access a GPU in another node, the firmware in the InfiniBand adapter must be changed, according to the directions in Mellanox User's Guide, 38 to provide several virtual instances (virtual functions, VFs) of the InfiniBand adapter, in addition to the real instance (physical function, PF). Each of these virtual functions will be provided, in an exclusive way, to a Xen VM by using the PCI PT mechanism. Moreover, given that an InfiniBand network is available, communication between the rCUDA clients in the VMs and the remote rCUDA server will be based on the use of the high performance InfiniBand Verbs API. Notice that in the later experiments involving the InfiniBand fabric, the remote GPU server is executed in a remote computer, which has not been virtualized and also whose Infini-Band network adapter makes use of the original firmware, which does not provide virtualization features. Similar to the scenario shown in Figure 14A, VMs will have 1 or several virtual instances of the real GPU, which is physically located in the remote node. Finally, it is important to remark that, although in this discussion, we only consider sharing a single GPU, the rCUDA middleware also allows sharing multiple GPUs.
The test bed used in this subsection to explore the use of the remote GPU virtualization inside Xen VMs is composed of 3 1027GR-TRF Supermicro nodes as the ones mentioned before. One of them will host the Xen VMs whereas the other 2 nodes will not make use of VMs.
In one of the native domains, we will execute the rCUDA server as shown in Figure 14B  Applications Catalog. 32 Figure 15 shows the performance of these 4 applications when executed in the following scenarios: FIGURE 15 Execution time of several applications when executed in different local and remote scenarios. Execution time is broken down into 3 components: GPU computation, GPU data transfer, and Other. GPU indicates graphics processing unit • Execution with CUDA with a local GPU in a native domain. Results for this scenario are referred to as "CUDA non-VM." • When CUDA is used in DomU 1 by using the PCI PT mechanism (rCUDA is not used), the label "CUDA VM PT" is used. In this case, the Xen VM will access the GPU in the host by making use of PCI PT.
• The label "rCUDA non-VM" refers to the performance of the rCUDA middleware when used between native domains (no Xen VM involved) making use of the InfiniBand network.
• When Xen VMs are involved in the tests, the performance of applications using rCUDA in the scenario depicted in Figure 14B is denoted by the label "rCUDA VM IB." • When using rCUDA in the scenario shown in Figure 14A, the performance of applications will be labeled as "rCUDA VM Local." Every experiment has been performed 10 times so that Figure 15 shows the averaged results. Furthermore, the plots in Figure 15 also  In general, the fact that the overhead of rCUDA is mainly due to data transfers between main memory and GPU memory was expected because once data are in the GPU memory, GPU computations require the same amount of time to be completed as in a native environment. In average, in the experiments, the overhead of running GPU-accelerated applications in a Xen VM with respect to a native domain is 2%, 2.8%, and 5.8% when using PCI PT, rCUDA over an InfiniBand fabric, and rCUDA over the Xen virtual network, respectively.

Benefit no. 6: GPU Migration: towards server consolidation
Maximizing resource utilization is one of the goals pursued when running a data center. By maximizing the utilization of the different resources in the computing facility, a larger revenue can be achieved, thus causing a faster amortization of the initial acquisition costs as well as making a larger profit afterwards.
Resource utilization in data centers evolves over time, depending on the exact workload applied at every moment. Therefore, at some point in time, the utilization of the GPUs in the cluster may be similar to that depicted in Figure 16A. This figure shows a small cluster composed of 14 nodes, each of them including 1 GPU. Next to each node, the utilization of its GPU is displayed. It can be seen that some nodes present a high GPU utilization. For instance, nodes 9 and 12 are using their GPUs at 90% approximately. On the contrary, some other nodes present a very low GPU utilization, such as nodes 6 or 11, whose GPUs are almost not used.
In this scenario, where some nodes of the cluster present a very low GPU utilization, it would be useful to gather the GPU jobs being executed in those nodes into other nodes. That is, it would be useful to consolidate the GPU jobs into a smaller number of servers, so that those nodes that become free can be switched off, thus reducing the energy consumption of the data center. Figure 16B depicts this consolidation of GPU jobs, where jobs generating a lower GPU utilization, such as the ones in nodes 2, 4, 5, 6, 8, 10, and 11 have been migrated to other nodes.
After job migration, the nodes sourcing the movement of jobs have been switched off, thus consuming a negligible amount of energy.
Conducting the migration of jobs using GPUs requires to migrate the process being executed at the CPU as well as the GPU part of the application. Migrating the CPU part of an application has been achieved in the past by many different frameworks. However, migrating the GPU part of a CUDA application is much more complex because of 2 reasons: (1) kernels being executed in the GPU run asynchronously with the CPU, and therefore, when migration is triggered, the kernel in the GPU could be under execution and (2) the GPU memory allocated to the application, which is not tracked by the operating system, must be copied to the destination GPU.
Addressing (1) above is easy. Once migration has been triggered, the migration framework could just execute a synchronization call (such as the cudaDeviceSynchronize function) to be able to wait for the completion of the kernels being executed in the GPU. However, addressing (2) is not so easy given that the map of memory used by the application at the GPU is not included in the system tables stored by the operating system for this process. Therefore, unless some additional support is implemented, it is not possible to retrieve which are the memory regions used by the application at the GPU.
For us to implement this additional support, the GPU memory management calls executed by the application (such as the cudaMalloc and cudaFree functions) could be intercepted. By intercepting them, it is possible to gather the required information for retrieving the memory regions used at the GPU. This is the usual approach followed by other frameworks that provide support for migrating applications that make use of GPUs, such as CheCUDA † . 40 Nevertheless, although obtaining the required information about memory regions used at the GPU makes it possible to migrate CUDA applications from 1 node to another, it is still possible to make this migration even more effective when the remote GPU virtualization mechanism is being leveraged. In this regard, when remote GPU virtualization is not used, migrating a CUDA application to another cluster node means that the destination node has (1) enough CPU cores FIGURE 16 Usage of rCUDA server migration in a cluster to consolidate graphics processing unit (GPU) jobs and reduce energy available for the application being migrated, (2) enough main RAM memory for hosting the application data, and (3) enough GPU memory to hold the data stored at the source GPU. Finding a target node within the cluster that complies with these 3 requirements may not be difficult. However, when the remote GPU virtualization technique is used, another approach could be followed to make the migration process more efficient. This new approach is based on the fact that the remote GPU virtualization mechanism detaches GPUs from nodes, from a logical point of view, and therefore, the CPU and GPU parts of the application may be migrated independently from each other to different destination nodes. In this manner, it would not be necessary that the 3 requirements described above are satisfied by a given node but these requirements could be split into 2 sets: finding a node that satisfies requirements (1) and (2) and finding another node that complies with requirement (3). The first set of requirements is intended for the migration of the CPU part of the application whereas the second set is devoted to the GPU part.
The new proposed migration approach would make it easier to find better node candidates than when the 3 requirements must be satisfied by the same node. Furthermore, given that the GPU part of the application was probably running in a node different from the CPU part, it would even be possible that only 1 of the parts (CPU or GPU) needs to be migrated. In this context, selecting whether to migrate the CPU part or the GPU part of an application would be based on the current cluster status and optimization policies. In any case, by using the remote GPU virtualization mechanism, it would be possible to consolidate both CPU and GPU servers at the same time that (1) migration is conducted faster because the amount of data to be moved is probably smaller (likely only 1 of the CPU or GPU parts is migrated) and (2) given that the target node only has to comply with a subset of the conditions above, finding a better destination candidate should be much easier than in the original approach.
The rCUDA middleware supports the migration of the GPU part of CUDA applications. In this regard, with the rCUDA framework, it is possible to select 1 of the multiple jobs using a given GPU and move it to another GPU in the same or in other node of the cluster. This process is transparent to the application using the GPU, which is not aware of the migration. Figure 17 shows the evolution of the execution time of 2 applications when up to 5 migrations are forced during their execution. Only the GPU part of the applications is migrated. Figure 17A depicts such evolution for a synthetic application whereas Figure 17B shows the evolution for the GPU-BLAST 24 application. Several nodes are used in these experiments. The characteristics of these nodes are the same as the ones mentioned before: 2 Xeon E5-2620 v2 sockets with 1 NVIDIA Tesla K20 GPU and 1 FDR InfiniBand adapter. EDR InfiniBand has also been considered. Given that source and destination GPUs are located at different cluster nodes, several interconnects and communication protocols are considered: RDMA over EDR InfiniBand, RDMA over FDR InfiniBand, TCP/IP over InfiniBand, and TCP/IP over 1 Gb Ethernet.
On the other hand, label "Reference" in Figure 17 refers to the execution time of the applications when CUDA is used (rCUDA is not used for executing the applications in this case).
The synthetic application used in Figure 17A performs the multiplication of a vector by a scalar. To that end, it initially allocates GPU memory for 1000 randomly sized arrays and fills them by copying data from host memory to GPU memory. Then the application launches the necessary kernels to apply the multiplication to the 1000 vectors, and finally, results are copied back from GPU to host memory and GPU memory is then released. The aggregated volume of memory used at the GPU for the 1000 arrays is 700 MB. When migration is triggered, the rCUDA framework performs 1000 allocations of GPU memory at the destination GPU, performs 1000 memory copies between source and destination GPUs, and then conducts 1000 memory releases at the FIGURE 17 Evolution of application execution time when rCUDA is leveraged for executing the applications using a remote graphics processing unit and an increasing amount of migrations are forced during application execution source GPU, which is freed and thus no longer related to the execution of the application. It can be seen in the figure that, as expected, the use of RDMA over InfiniBand provides the smallest migration overhead given the superior features of this communication mechanism. Figure 17B shows a similar study for the GPU-BLAST application.
In this case, the application holds 1300 MB of data in 9 regions of GPU memory. Therefore, every time the application is migrated, the rCUDA framework must allocate 9 memory regions in the destination GPU, must copy the 9 regions from source to destination GPUs, and finally must release the 9 regions at the source GPU. It can be seen that migration overhead is negligible when RDMA is used.

CONCLUSIONS
In this paper, it has been shown that the use of the remote GPU virtualization technique provides several benefits to computing facilities.
For instance, the improvements attained in execution time for a batch of jobs have been quantified. The associated reduction in energy consumption has also been presented. These features may be interesting in the context of exascale computing facilities given that 1 of the walls in this area is the hard power consumption limitation.
Other benefits of this novel virtualization mechanism have also been explored. Perhaps, the most significant one may be GPU migration. In this manner, we have shown that migrating GPU jobs from 1 GPU server to another is quite complex to perform in an efficient way when the remote GPU virtualization mechanism is not being used. On the contrary, GPU job migration is very simple when the rCUDA technology is used due to the fact that rCUDA intercepts all the CUDA calls and tracks the state of the memory areas used by the application in the GPU. Migrating GPU jobs would be an inexpensive and efficient way of consolidating GPU servers, so that as many GPU jobs as possible are packed together, switching off those GPU servers not required. This would be a means of further reducing the total energy consumed in exascale computing facilities.