Improving the management efficiency of GPU workloads in data centers through GPU virtualization

Graphics processing units (GPUs) are currently used in data centers to reduce the execution time of compute‐intensive applications. However, the use of GPUs presents several side effects, such as increased acquisition costs and larger space requirements. Furthermore, GPUs require a nonnegligible amount of energy even while idle. Additionally, GPU utilization is usually low for most applications. In a similar way to the use of virtual machines, using virtual GPUs may address the concerns associated with the use of these devices. In this regard, the remote GPU virtualization mechanism could be leveraged to share the GPUs present in the computing facility among the nodes of the cluster. This would increase overall GPU utilization, thus reducing the negative impact of the increased costs mentioned before. Reducing the amount of GPUs installed in the cluster could also be possible. However, in the same way as job schedulers map GPU resources to applications, virtual GPUs should also be scheduled before job execution. Nevertheless, current job schedulers are not able to deal with virtual GPUs. In this paper, we analyze the performance attained by a cluster using the remote Compute Unified Device Architecture middleware and a modified version of the Slurm scheduler, which is now able to assign remote GPUs to jobs. Results show that cluster throughput, measured as jobs completed per time unit, is doubled at the same time that the total energy consumption is reduced up to 40%. GPU utilization is also increased.


INTRODUCTION
The use of graphics processing units (GPUs) has become a widely accepted way of reducing the execution time of applications. The massive parallel capabilities of these devices are leveraged to accelerate specific parts of applications. Programmers exploit GPU resources by off-loading the computationally intensive parts of applications to them. In this regard, although programmers must specify which parts of the application are executed on the CPU and which parts are off-loaded to the GPU, the existence of libraries and programming models such as CUDA 1 or Open Computing Language 2 noticeably ease this task. The net result is that these accelerators are used to significantly reduce the execution time of applications from domains as different as data analysis (big data), 3 chemical physics, 4 computational algebra, 5 image analysis, 6 finance, 7 and biology 8 to name only a few.
Many current data centers leveraging GPUs typically include one or more of these accelerators in every node of the cluster. Figure 1 shows an example of such a deployment, composed of n nodes, each of them containing two CPU sockets and one GPU. The example in Figure 1 might be a representation of a typical cluster configuration composed of n SYS1027-TRF Supermicro servers interconnected by an FDR InfiniBand network.
Each of the servers in Figure 1 may include, for instance, two Xeon E5-2620 v2 processors and one NVIDIA Tesla K20 GPU. However, the use of GPUs in such a deployment is not exempt from side effects. For instance, let us consider the execution of a message passing interface (MPI) application that does not require the use of GPUs. Typically, this application will spread across several nodes of the cluster flooding the CPU Interconnection Network FIGURE 1 Example of a GPU-accelerated cluster cores available in them. In this scenario, the GPUs in the nodes involved in the execution of such an MPI application would become unavailable for other applications because all the CPU cores in those nodes would be busy. In other words, the execution of nonaccelerated applications in some nodes may prevent other applications from making use of the accelerators installed in those nodes, forcing those GPUs to remain idle for some periods. The consequence will be that the initial hardware investment will require more time to be amortized, whereas some amount of energy is wasted because idle GPUs still consume some power. † Another important concern associated with the use of GPUs in clusters is related to the way that workload managers such as Slurm 9 perform the accounting of resources in a cluster. These workload managers use a fine granularity for resources such as CPUs or memory but not for GPUs. For instance, workload managers can assign CPU resources in a per-core basis, thus being able to manage a shared usage of the CPU sockets present in a server among several applications. This per-core assignment increases overall CPU utilization, speeding up the amortization of the initial investment in hardware. In the case of memory, workload managers can also assign, in a shared approach, the memory present in a given node to the several applications that will be concurrently executed in that server. However, in the case of GPUs, workload managers use a per-GPU granularity. In this regard, GPUs are assigned to applications in an exclusive way. Therefore, a given GPU cannot be shared among several applications even in the case that this GPU has enough resources to allow the concurrent execution of those applications. This per-GPU assignment causes that, in general, overall GPU utilization is low because few applications present enough computational concurrency to keep GPUs in use all the time.
To address these concerns, the remote GPU virtualization mechanism could be used. This software mechanism allows an application being executed in a computer, which does not own a GPU to transparently make use of accelerators installed in other nodes of the cluster. In other words, the remote GPU virtualization technique allows physical GPUs to be logically detached from nodes. This allows that decoupled (or virtual) GPUs are concurrently shared by all the nodes of the computing facility. Furthermore, given that the remote GPU virtualization mechanism allows GPUs to be transparently used from any node in the cluster, it is possible to create cluster configurations where not all the nodes in the cluster own a GPU. This feature would not only reduce the costs associated with the acquisition and later use of GPUs but would also increase the overall utilization of such accelerators because workload managers would assign the acquired GPUs concurrently to several applications as far as GPUs present enough resources for all of them. In general, remote GPU virtualization presents many benefits. 10 Notice, however, that workload managers need to be enhanced to manage virtual GPUs. This enhancement would basically consist in replacing the current per-GPU granularity by a finer granularity that should allow GPUs to be concurrently shared among several applications. Once this enhancement is performed, it is expected that overall cluster performance is increased because the concerns previously mentioned would be reduced. It would also be possible to consider attaching a server owning several GPUs to a cluster that does not include GPUs. In this way, upgrading a nonaccelerated cluster so that it includes GPUs would become an easy and inexpensive process.
In this paper, we present a study of the performance of a cluster that makes use of the remote GPU virtualization mechanism along with an enhanced workload manager able to assign virtual GPUs to waiting jobs. To that end, we have made use of the remote CUDA (rCUDA) 11 remote GPU virtualization middleware along with a modified version 12 of the Slurm Workload Manager, which is now able to dispatch GPU-accelerated applications to nodes not owning GPUs while assigning such applications as many GPUs from other nodes as they require. A preliminary version of this work was already presented. 13 Notice, however, that in this paper, we further investigate the performance results reported by the integration of rCUDA and Slurm.
This paper is laid out as follows. Section 2 presents the basic required background on the rCUDA remote GPU virtualization framework and on the modified Slurm Workload Manager to understand the rest of the paper. Later, Section 3 presents a thorough performance study of a cluster using rCUDA and the modified version of Slurm. Section 4 presents a review of the state of the art on the remote GPU virtualization and workload manager areas. Finally, Section 5 concludes this paper. † Although GPUs present a favorable performance/power ratio while being used, they still require nonnegligible amounts of energy while idle. For instance, idle NVIDIA Tesla K20 and K40 GPUs require, respectively, 25 W and 20 W. On the contrary, new NVIDIA Tesla K80 GPUs have significantly reduced the amount of energy consumed in the idle state, although they still present a nonnegligible power consumption while being active without performing computations. BACKGROUND ON rCUDA AND SLURM The purpose of this section is twofold. First, this section presents the required background on the rCUDA remote GPU virtualization framework so that the reader can understand the performance evaluation presented in Section 3. Notice that this section is focused on the rCUDA middleware.
A complete discussion on the remote GPU virtualization mechanism as well as a description of the available frameworks can be found in Section 4. Second, this section also presents the required background on the Slurm Workload Manager to follow Section 3. A summary of the main changes applied to Slurm so that virtual GPUs provided by rCUDA can be managed by Slurm is also provided. A detailed description of the applied changes can be found in the work of Iserte et al. 12 Additionally, a complete discussion about workload managers can be found in Section 4.

Background on rCUDA
Frameworks such as CUDA 1 assist programmers in using GPUs for general-purpose computing. In addition, several remote GPU virtualization solutions exist for this framework, such as GridCuda, 14 DS-CUDA, 15 gVirtuS, 16 vCUDA, 17 GViM, 18 and rCUDA. 11 Current virtualization frameworks provide different features. This section focuses on describing the main characteristics of the rCUDA framework. A complete discussion on the state-of-the-art on remote GPU virtualization frameworks can be found in Section 4. Figure 2 depicts the architecture of the rCUDA framework, which is similar to that of most of these virtualization solutions, as shown in Figure 11. The rCUDA framework follows a client-server distributed approach. The client part of the middleware is installed in the cluster node executing the application requesting GPU services, whereas the server side runs in the computer owning the actual GPU. The client middleware offers the same application programming interface (API) as does the NVIDIA CUDA Runtime 19 and Driver 20 APIs (except for graphics functions).
It is binary compatible with CUDA 9.0 and also provides support for the libraries included within CUDA (cuBLAS, cuFFT, cuDNN, etc). Every time the accelerated application performs a CUDA call, the client side of rCUDA receives the request from the application and appropriately processes and forwards it to the remote server. In the server node, the middleware receives the request and interprets and forwards it to the GPU, which completes the execution of the request and provides the execution results to the server middleware. In turn, the server sends back the results to the client middleware, which forwards them to the initial application, which is not aware that its request has been served by a remote GPU instead of a local one.
The rCUDA framework supports several underlying interconnection technologies by making use of network-specific communication modules.
Currently, three communication modules are available: Transmission Control Protocol/Internet Protocol (TCP/IP), InfiniBand, and RoCE. The former can be used in any TCP/IP compatible network, whereas the latter two make use of the high-performance InfiniBand Verbs API available in the InfiniBand and RoCE network adapters. To maximize performance, rCUDA has been perfectly tuned to the InfiniBand Verbs API. 21 Furthermore, as shown by Reaño and Silla,22 rCUDA outperforms the rest of the available remote GPU virtualization solutions. Additionally, rCUDA has been applied to different areas with very good results. [23][24][25][26] rCUDA can also be used in cloud computing scenarios. 27,28 For these reasons, we use this middleware in our study.
Using rCUDA requires to set three environment variables prior to application execution: RCUDA_DEVICE_COUNT, RCUDA_DEVICE_j, and RCUDA_NETWORK. The first variable indicates the amount of remote virtual GPUs accessible to the application. For example, if two remote GPUs are assigned to the application, then the command ''export RCUDA_DEVICE_COUNT=2'' should be executed. The second environment variable, RCUDA_DEVICE_j, indicates, for each of the n remote GPUs assigned to the application, in which cluster node the GPU with identifier j is located. For instance, in the previous example, the commands ''export RCUDA_DEVICE_0=192.168.0.1'' and ''export RCUDA_DEVICE_1=192.168.0.2'' should be executed to inform the rCUDA client about the location of the virtual GPUs assigned to FIGURE 2 Architecture of the rCUDA remote GPU virtualization middleware should be used to leverage the InfiniBand Verbs API.

Background on Slurm
Current workload managers do not support the virtual GPUs provided by frameworks such as rCUDA due to their novelty. In this regard, workload managers are only able to deal with real GPUs. Therefore, when a job includes within its computing requirements one or more GPUs per node, current workload managers will try to map that job to nodes owning the requested amount of real GPUs. Nevertheless, it is possible to enhance current workload managers so that they become aware of virtual GPUs. This would make the assignment of GPUs more flexible because any available GPU across the cluster might be assigned to a job, regardless of the exact GPU and job locations. In this way, by increasing the awareness of workload managers, they would provide support for virtual GPUs, hence allowing the scheduling process to enjoy a larger degree of freedom.
In this paper, we make use of an extended version of the Slurm Workload Manager, 12 which supports the rCUDA middleware. Selecting Slurm among the many available job schedulers was based on its open-source nature, at the same time that Slurm has demonstrated to be portable and interconnect independent, thus making it suitable for many different cluster architectures. A complete discussion on workload managers can be found in Section 4.
Next, we present the six main modifications to the Slurm Workload Manager to make it virtual-GPU aware (the reader may refer to 12 for a thorough description of the modifications done to Slurm).
1. The GRes module, which manages the allocation and deallocation of consumable generic resources such as GPUs, has been augmented so that all GPUs in the cluster can be accessed from all the nodes. Additionally, GPUs in the cluster can be shared among different jobs.
2. Two new plug-ins have been implemented. On the one hand, the new GRes plug-in ''gres/rgpu'' is responsible for the declaration of the remote GPUs as a generic resource, which will be referred to as rGPU. On the other hand, the select plug-in ''select/cons_rgpu'' will perform tasks related to selection and scheduling of the new rGPUs.
3. Several internal data structures within Slurm have been modified with new attributes to maintain the required information about the new rGPUs.
4. The RPC packages within Slurm have been augmented with additional fields intended to carry the rGPU information required by Slurm.
5. The job submission commands within Slurm have been modified so that they accept the new parameters related to the use of the new rGPU resources.
6. To connect the scheduling process with the rCUDA middleware, the Slurm scheduler has to set the three rCUDA environment variables mentioned in the previous section.
In addition to the previous modifications, some policy must be followed during the scheduling process to select one GPU or another from the many GPUs available in the cluster. In this work, we have followed a round-robin approach while giving a higher priority to those rGPUs located in the same node that will execute the application. Other selection policies were also considered, although performance results did not vary significantly.
Once these changes are implemented, Slurm users are able to submit jobs to the system queues in three different modes.

CUDA:
No change is required to the original way of launching jobs.
2. rCUDA shared: The job will use the new rGPU resources, which will be shared with other jobs. The required amount of GPU memory should be specified as a parameter to the job submission command. For instance, ''srun -rcuda-mode=shar -gres=rgpu:4:100 M job.sh'' will submit a job named ''job.sh'' requesting four virtual GPUs having each of them at least 100 MB of available memory. These GPUs may be shared with other jobs.
3. rCUDA exclusive: The job will use the new rGPU resources but will not share them with other jobs. For instance, ''srun -rcuda-mode=excl -gres=rgpu:4 job.sh'' will submit a job named ''job.sh'' requesting the exclusive use of four virtual GPUs.

PERFORMANCE ANALYSIS
In this section, we study the impact that using the remote GPU virtualization mechanism in combination with Slurm has on the performance of a data center. To that end, we have executed several workloads in a cluster by submitting a series of job requests to the Slurm queues. After job submission, we have measured several parameters such as the total execution time of the workloads, energy required to execute them, GPU utilization, etc. We have considered two different scenarios for workload execution. In the first one, the cluster uses CUDA, and therefore, applications can only use those GPUs installed in the same node where the application is being executed. In this scenario, an unmodified version of Slurm has been used. In the second scenario, we have made use of rCUDA and therefore an application being executed in a given node can use any of the GPUs available in the cluster. Moreover, the modified version of Slurm has been used so that it is possible to schedule the use of remote GPUs. These two scenarios will allow to compare the performance of a cluster using CUDA with the throughput of a cluster using rCUDA.
In the following subsections, we present the performance analysis. In this regard, we first present the cluster configuration and the workloads used in the experiments. After that, we analyze the performance of combining rCUDA with Slurm in different cluster configurations.

Cluster test bed
The test bed used in this study is comprised of 1027GR-TRF Supermicro servers. Each of the servers includes two Intel Xeon E5-2620 v2 processors (six cores with Ivy Bridge architecture) operating at 2.1 GHz and 32 GB of DDR3 SDRAM memory at 1600 MHz. They also have a Mellanox ConnectX-3 VPI single-port FDR InfiniBand adapter connected to a Mellanox Switch SX6025 (FDR InfiniBand compatible) to exchange data at a maximum rate of 56 Gb/s. Furthermore, an NVIDIA Tesla K20 GPU is installed in each node.
To analyze how the obtained performance results depend on cluster size, we have considered three cluster sizes for the experiments: 4 nodes, 8 nodes, and 16 nodes. Obviously, the cluster configuration composed of 16 nodes is the most representative one (although it is still smaller than most data centers). However, these three sizes will allow us to study the different trends of the performance metrics. In all the three cluster sizes mentioned, one additional node has been leveraged. This additional node, which does not include a GPU, will be used as the Slurm management node and will execute the central Slurm daemon responsible for scheduling jobs (the slurmctld process).
Regarding the software configuration of the cluster, Linux CentOS 6.4 was used along with Mellanox OFED 2.4-1.0.4 (InfiniBand drivers and administrative tools). Slurm version 14.11.0 was used. The modifications described in Section 2.2 were applied to Slurm. It was configured to use the backfill scheduling policy. In this way, jobs can overtake others. Finally, version 2.0b of the MVAPICH2 implementation of MPI, specifically tuned for InfiniBand, was used for those applications requiring the MPI library.

Workloads
Several workloads have been considered to provide a more representative range of results. The workloads are composed of applications (see Table 1) selected because of their different characteristics from the list of NVIDIA's Popular GPU-Accelerated Applications Catalog. 29 • GPU-BLAST 30 has been designed to accelerate the gapped and ungapped protein sequence alignment algorithms of the NCBI-BLAST ‡ implementation using GPUs.
• LAMMPS 31 is a classic molecular dynamics simulator that can be used to model atoms or, more generically, as a parallel particle simulator at the atomic, mesoscopic, or continuum scale.
• mCUDA-MEME 32 is a parallel CUDA implementation of the MEME algorithm, used for discovering motifs in a group of related DNA or protein sequences.
• GROMACS 33 is a molecular dynamics simulator, similar to LAMMPS. Although this package can use GPUs, in this study, we will use a nonaccelerated version to achieve a higher degree of heterogeneity in our experimental workloads.
• BarraCUDA 34 is a sequence mapping software program that uses GPUs to accelerate the inexact alignment of short sequence reads to a particular location on a reference genome.
• MUMmerGPU is the GPU implementation of MUMmer, 35  • GPU-LIBSVM is a modification of the original LIBSVM 36 algorithm that exploits the CUDA framework to significantly reduce processing time.
LIBSVM is an integrated software intended for vector classification, regression, and distribution estimation.
• NAMD 37 is a parallel molecular dynamics simulator designed for high-performance simulation of large biomolecular systems. Although this application is able to use GPUs, the version of NAMD used in our study does not make use of them and therefore is intended to contribute to a higher degree of heterogeneity of the workloads. Table 1 provides additional information about the applications used in this work, such as the exact execution configuration used for each of the applications, showing the amount of processes and threads used for each of them. It can be seen in the table that LAMMPS, mCUDA-MEME, GROMACS, and NAMD are MPI applications that will spread across several nodes in the cluster. On the contrary, the other four applications will execute in a single node. Additionally, some of the applications also make use of threads. For instance, the GPU-BLAST application uses a single process composed of six threads. During execution, each of these threads will use a different CPU core, although all of them will make use of the same GPU. In a similar way, the NAMD application will be distributed across four different nodes of the cluster (four processes) and 12 threads will be launched at each node. Therefore, the NAMD application will make use of four entire nodes. In a similar way, the GROMACS application will keep busy two entire nodes while being executed. Notice that we have configured the execution of the considered applications as shown in Table 1 according to a previous scalability analysis (not shown), which was carried out in advance to find out the best configuration for each application. Table 1 also shows the execution time for each application, which ranges from 15 up to 763 s for LAMMPS and BarraCUDA, respectively.
Applications can be classified according to their execution time. In this regard, GPU-BLAST, LAMMPS, mCUDA-MEME, and GROMACS require less than 170 s to complete execution (they are ''short'' applications), whereas BarraCUDA, MUMmerGPU, GPU-LIBSVM, and NAMD require more than 240 s to be executed (''long'' applications). In addition to execution time, Table 1 also shows the GPU memory required by each application. For those applications composed of several processes, the amount of GPU memory depicted in Table 1 refers to the individual needs of each process. Notice that the amount of GPU memory is not specified for the GROMACS and NAMD applications because we are using nonaccelerated versions of these applications.
In summary, the eight applications used not only present different characteristics regarding the amount of processes and threads used by each of them and their execution time but they also present different GPU usage patterns, which include both memory copies to/from GPUs and also kernel executions. Therefore, although the set of applications considered is finite, it may provide a representative sample of a workload typically found in current data centers. Actually, the set of applications in Table 1 Table 1 becomes even more representative. Table 2 displays the Slurm parameters used for launching each of the applications with CUDA and rCUDA. In the first case, CUDA will be used (column labeled ''Launch with CUDA''). In the second case, remote GPUs can either be used in an exclusive or shared way. In the first approach, the column labeled as ''Launch with rCUDA exclusive'' shows that the rcuda-mode parameter is set to excl and no GPU memory is declared.
The previous applications have been combined to create three different workloads as shown in Table 3. Workload labeled as ''Set 1'' is composed of 400 instances from applications GPU-BLAST, LAMMPS, mCUDA-MEME, and GROMACS. Notice that these applications are the ones with the shortest execution times. The exact amount of instances for each application is shown in the table. In a similar way, workload labeled as ''Set 2'' is composed of 400 instances of applications BarraCUDA, MUMmerGPU, GPU-LIBSVM, and NAMD (these applications are the ''long'' applications). Finally, a third workload, referred to as ''Set 1+2,'' has been created with instances from all the applications. Notice that, for each of the workloads, the instances from different applications as well as the exact sequence of instances within the workload are randomly set.
However, once workloads are set, they remain constant across the different experiments presented in this section. This means that the amount of instances of each application and the exact sequence of these instances is not modified across experiments.

Initial performance analysis: n nodes with one GPU each
This first experiment considers the simplest scenario consisting of a cluster with n nodes each of them owning one GPU. The three cluster sizes mentioned in Section 3.1 were used. Figure 3 shows the performance results for the 16-node case. The other two cluster sizes provided similar trends. The figure shows, for each of the workloads depicted in Table 3, the performance when CUDA is used along with the original Slurm Workload Manager (results labeled as ''CUDA'') as well as the performance when rCUDA is used in combination with the modified version of Slurm. In this case, label ''rCUDAex'' refers to the results when remote GPUs are used in an exclusive way by applications, whereas label ''rCUDAsh'' refers to the case when remote GPUs are shared among several applications. Among both rCUDA uses, the shared approach is the most interesting option. The exclusive case is considered in this paper only for comparison purposes. Figure 3A shows the total execution time for each of the workloads. Figure 3B depicts the averaged GPU utilization for all the 16 GPUs in the cluster. Data for GPU utilization have been gathered by polling each of the GPUs in the cluster once every second and afterwards averaging all the samples after completing workload execution. The nvidia-smi command was used for polling the GPUs. In a similar way, Figure 3C shows total energy required for completing workload execution. Energy has been measured by polling once every second the power distribution units (PDUs) present the cluster. Used units are APC AP8653 PDUs, which provide individual energy measurements for each of the servers connected to them. After workload completion, the energy required by all servers was aggregated to provide the measurements in Figure 3C. § In our cluster test bed, there is only one GPU per node.

of 16
ISERTE ET AL.

FIGURE 4
Normalized execution time when several concurrent instances of the same application are executed with CUDA As shown in Figure 3A, workload ''Set 1'' presents the smallest execution time, given that it is composed of the applications with the smallest execution times. Furthermore, using rCUDA in a shared way reduces execution time for the three workloads. In this regard, execution time is reduced by 48%, 37%, and 27% for workloads ''Set 1,'' ''Set 2,'' and ''Set 1+2,'' respectively. Notice also that the use of remote GPUs in an exclusive way also reduces execution time. In the case for ''Set 2'', this reduction is more noticeable because when CUDA is used, the NAMD application (with 101 instances in the workload) spans over four complete nodes thus blocking the GPUs in those nodes, which cannot be used by any accelerated application during the entire execution time of NAMD (241 s). On the contrary, when ''rCUDAex'' is leveraged, the GPUs in those four nodes are accessible from other nodes; therefore, they can be used by other applications being executed at other nodes. Regarding GPU utilization, Figure 3B shows that the use of remote GPUs helps to increase overall GPU utilization. Actually, when ''rCUDAsh'' is used with ''Set 1'' and ''Set 1+2,'' average GPU utilization is doubled with respect to the use of CUDA. Finally, the total energy consumption is reduced accordingly, as shown in Figure 3C, by 40%, 25%, and 15% for workloads ''Set 1,'' ''Set 2,'' and ''Set 1+2,'' respectively.
There are several reasons for the benefits obtained when GPUs are shared across the cluster. First, as already mentioned, the execution of the nonaccelerated applications makes that GPUs in the nodes executing them remain idle when CUDA is used. On the contrary, when rCUDA is leveraged, these GPUs can be used by applications being executed in other nodes of the cluster. Notice that this remote usage of GPUs belonging to nodes with busy CPUs will be more frequent as cluster size increases because more GPUs will be blocked by nonaccelerated applications (also depending on the exact workload). Another example is the execution of LAMMPS and mCUDA-MEME, which require four nodes with one GPU.
While these applications are being executed with CUDA, those four nodes cannot be used by any other application from Table 1: On the one hand, the other accelerated applications cannot access the GPUs in those nodes because they are busy, and on the other hand, the non-GPU applications (GROMACS and NAMD) cannot use those nodes because they require all the CPU cores but LAMMPS and mCUDA-MEME already took one core. However, when GPUs are shared among several applications, GPUs assigned to LAMMPS and mCUDA-MEME can also be assigned to other applications that will run in any available CPU in the cluster, thus increasing overall throughput. This concurrent usage of the GPUs brings to a second cause for the improvements shown in Figure 3.
The second reason for the improvements shown in Figure 3 is related to the usage that applications make of GPUs. As Table 1 showed, some applications do not completely exhaust GPU memory resources. For instance, applications mCUDA-MEME and GPU-LIBSVM only use about 3% of the memory present in the NVIDIA Tesla K20 GPU. However, the unmodified version of Slurm (combined with CUDA) will allocate the entire GPU for executing each of these applications, thus causing that almost 100% of the GPU memory is wasted during application execution.
This concern is also present for other applications in Table 1. Moreover, if NVIDIA Tesla K40 GPUs were used instead of the NVIDIA Tesla K20 devices employed in this study, then this memory underutilization would be worse because the K40 model features 12 GB of memory instead of the 5 GB available in the Tesla K20 device. On the contrary, when rCUDA is used in a shared way, GPUs can be shared among several applications provided that there is enough memory for all of them. Obviously, GPU cores will have to be multiplexed among all those applications, that will cause all of them to execute slower. In this regard, Figure 4 presents execution times for the applications in Table 1 when several instances of the same application are concurrently executed in a GPU ¶ . Executions in Figure 4 have been manually constrained to a single node using CUDA without the use of Slurm. For some of the applications, only two concurrent instances were executed due to their larger memory requirements. In a similar way, BarraCUDA does not allow the concurrent execution of other instances due to its high memory requirements. As shown, executing several instances of the same application reports a speed up for all of them: LAMMPS achieves the smallest one, whereas GPU-LIBSVM obtains significant benefits. In summary, sharing a GPU among several applications reduces total execution time. This reduction makes that combining rCUDA with the modified version of Slurm results in important reductions in the time required to complete workload execution.
Another possible point of view related to sharing GPUs among applications is that all the applications sharing the GPU execute slower because they have to share the GPU cores. However, despite the slower execution of each individual application, the entire workload is completed earlier, as shown in Figure 3. This means that the time spent by applications waiting in the Slurm queues is reduced and the execution of each individual application is completed earlier. As a consequence, data center users increase their satisfaction regarding the service received. ¶ It is also possible to analyze concurrent executions when the applications concurrently using the GPU are different. However, using several instances of the same application generates a higher pressure on the system because all the instances will try to synchronously perform the same operations.   One additional metric that could be analyzed is the time that GPUs remain allocated to applications. Figure 5 presents the time that any GPU in the cluster is assigned to an application and compares that time with total execution time of the workload. It can be seen that the use of ''rCUDAex'' increases the percentage of time that GPUs are assigned to applications up to 96%, whereas this percentage is reduced for ''rCUDAsh'' to values equal to 59%, 83%, and 74% for ''Set 1,'' ''Set 2,'' and ''Set 1+2'' respectively. These lower percentages point out that, when rCUDAsh is leveraged, execution time of workloads is dominated by the non-GPU applications because accelerated ones take advantage of available remote GPU resources to complete their execution before all nonaccelerated applications have finished.
Finally, as shown in Figure 4, the BarraCUDA application presents high GPU memory requirements, which reduce the opportunity to share its GPU with other applications. Thus, executions of BarraCUDA behave in a similar way to the ''rCUDAex'' mode. To analyze the stability of the results when this behavior is reduced, two new workloads have been created. Table 4 shows these new workloads. It can be seen that the presence of the heavy BarraCUDA application has been noticeably reduced at the same time that the presence of lighter applications such as GPU-LIBSVM has been increased. Figure 6 shows the performance results for these new workloads. It can be seen that although the workloads have been noticeably modified, the trends of the results are similar to those in Figure 3. Results for cluster sizes of 4 and 8 nodes also showed similar trends. Moreover, Figure 7 depicts the GPU allocation time with the new workloads. Again, the same trend as with the previous workloads is followed. This suggests that the performance improvements shown in this section can be expected for many other workloads and cluster sizes. Finally, once it has been seen that rCUDA exclusive performs better than CUDA but worse than rCUDA shared mode; henceforth, we will continue the performance study taking only into account rCUDA shared.

Introducing additional heterogeneity in the cluster
Current data centers execute a large variety of applications. Some of them address highly parallel problems that can benefit from the use of several GPUs. In these cases, the MPI library can be used to distribute the application processes across several nodes in the cluster so that each process makes use of the GPU installed in the node executing that process. The net result is that the application is concurrently making use of several GPUs due to the MPI library. However, using the MPI library for some applications may not be the best option. For instance, if communications among processes are too intense, then the use of a messaging library would noticeably reduce overall performance. In these cases, instead of following a distributed approach for designing the application, it could be programmed by leveraging the shared memory paradigm. In this context, the application would be divided into threads, and each thread would be responsible for submitting its kernels to the GPU. However, to efficiently execute such a shared-memory parallel application, as many GPUs as threads should be available in the computer executing the application. In our cluster example, for instance, where only one GPU is available in each node, it would not be possible to execute this kind of applications. One possible solution could be to augment the cluster with one or more servers featuring several GPUs. This would create a nonuniform heterogeneous cluster.
To model this scenario, we have replaced one of the nodes in our cluster by a node containing four GPUs. This node is based on the Supermicro SYS7047GR-TRF server, populated with four NVIDIA Tesla K20 GPUs and one FDR InfiniBand network adapter. Additionally, to model the use of these parallel shared-memory applications, we have modified the workloads used in the experiments by modeling applications with two and four threads that require two and four GPUs, respectively. To that end, two different flavors of the LAMMPS and mCUDA-MEME applications have been used, as shown in Table 5. First is ''LAMMPS long 2p'' and ''mCUDA-MEME long 2p,'' which consist of two single-threaded processes that are forced to be executed in the same node by using the launching parameters -N1 -n2 -c1. These instances of the application will model the use of two-thread shared-memory applications. Second is ''LAMMPS long 4p'' and ''mCUDA-MEME long 4p,'' which consist of four single-threaded processes that will execute in the same node by using the launching parameters -N1 -n4 -c1. They will model the use of four-thread shared-memory applications. One additional flavor of these applications will model single-thread shared-memory applications. This (A) (B) (C) FIGURE 8 Performance results from a nonuniform cluster with 15 1-GPU nodes and one 4-GPU node. A, Total execution time of the workloads; B, Average GPU utilization; C, Total energy consumed during the execution of the workloads FIGURE 9 GPU allocation time in a nonuniform cluster with 15 1-GPU nodes and one 4-GPU node additional flavor is composed by the ''LAMMPS short'' and ''mCUDA-MEME short'' cases shown in Table 5, which make use of one single-threaded process with launching parameters -N1 -n1 -c1. Furthermore, small input data sets are used for the ''LAMMPS short'' and ''mCUDA-MEME short'' cases, whereas the multithreaded flavors use a large input data set to lengthen their execution time. The idea behind the ''2p'' and ''4p'' flavors of LAMMPS and mCUDA-MEME is that they necessarily require the 4-GPU server for their execution when a cluster not using rCUDA is employed.
On the contrary, when rCUDA is used, all the threads will be placed in the same node, but these threads will be able to use any of the GPUs in the cluster, thus loosening up the initial limitation of having to wait for the 4-GPU node and thus speeding up the execution of the workload. Figure 8 depicts the performance results when the workloads in Table 5 are executed in a nonhomogeneous cluster composed of 15 nodes owning one GPU and one additional node populated with four GPUs. It can be seen that decoupling GPUs from nodes with rCUDA provides large benefits because applications requiring two (or four) GPUs can start execution as soon as there are enough available resources in remote GPUs across the cluster. Contrariwise, when CUDA is used, applications spend a lot of time waiting for the 4-GPU node. This waiting is shown in Figure 9, which shows the small percentage of GPU allocation time when CUDA is used for these workloads. This small GPU allocation time suggests that applications spend most of the time waiting for resources.

Attaching GPUs to a non-GPU cluster
The use of GPUs in a cluster usually puts several burdens in the physical configuration of the nodes in the cluster. For instance, nodes owning a GPU need to include larger power supplies able to provide the energy required by the accelerators. Also, GPUs are not small devices; therefore, they require a nonnegligible amount of space in the nodes where they are installed. These requirements make installing GPUs in a cluster, which did not initially include them, sometimes expensive (power supplies need to be upgraded) or simply impossible (nodes do not have enough space for the GPUs). However, the workload in some data centers may evolve towards the use of GPUs. At that point, the concern is how to address the introduction of GPUs in the computing facility.
One possible solution to the concern above is acquiring some amount of servers populated with GPUs and divert the execution of accelerated applications to those nodes. The Slurm Workload Manager would automatically take care of dispatching the GPU-accelerated applications to the new servers. However, although this approach is feasible, it presents the limitation that GPU jobs will probably have to wait for long until one of the GPU-enabled servers is available, even though GPU utilization is usually low. Another concern is that MPI accelerated applications will only be able to span to as many nodes as GPU-enabled servers were acquired. Given these concerns, a better approach would be to acquire some amount of servers populated with GPUs and use rCUDA to execute accelerated applications at any of the nodes in the cluster while using the GPUs in the new servers. This solution would not only increase overall GPU utilization with respect to the use of CUDA in the previous scenario but would also allow that MPI applications span to as many nodes as required because MPI processes would be able to remotely access GPUs due to rCUDA. In summary, the remote GPU virtualization mechanism allows that clusters that did not initially include GPUs can be easily and cheaply updated for using GPUs by attaching to them one or more computers containing GPUs. In this way, the original nodes will make use of the GPUs installed in the new nodes, which will become GPU servers. Slurm would be used to manage the use of the GPUs in the new servers.
(A) (B) (C) FIGURE 10 Performance results when a server with four GPUs is attached to a non-GPU cluster. A, Total execution time of the workloads; B, Average GPU utilization; C, Total energy consumed during the execution of the workloads Figure 10 shows the performance results when a server with four GPUs has been attached to a cluster without GPUs. The original cluster is composed of 15 nodes (the same as in the previous section, but GPUs have been removed). The 4-GPU server is the same as in previous section.
Results are similar to those presented in the previous section and show that decoupling GPUs from nodes with rCUDA allows applications to make a much more flexible usage of the resources in the cluster and therefore execution time is reduced as well as energy consumption.

RELATED WORK
In this section, an overview of the state of the art is provided. We first review other studies related to GPU virtualization, and then a review about workload managers is addressed.

GPU virtualization
Sharing accelerators among several computers has been addressed both with hardware and software approaches. On the hardware side, maybe the most prominent solution was NextIO's N2800-ICA, 38 based on PCI Express (PCIe) virtualization. 39 This solution allowed to share a GPU among eight different servers in a rack within a 2-m distance. Nevertheless, this solution lacked the required flexibility because a GPU could only be used by a single server at a time, thus preventing the concurrent sharing of GPUs. Furthermore, this solution was expensive, which may be one of the reasons for NextIO going out of business in August 2013.
As a flexible alternative to hardware approaches, several software-based GPU sharing mechanisms have appeared, such as V-GPU, DS-CUDA, rCUDA, vCUDA, and GridCuda, for example. Basically, these software proposals share a GPU by virtualizing it, so that they provide applications with virtual instances of the real device, which can therefore be concurrently shared. Usually, these GPU sharing solutions place the virtualization boundary at the API level, thus offering the same API as the NVIDIA CUDA Runtime API 19 does. Figure 11 depicts the architecture usually deployed by these virtualization solutions, which follow a distributed client-server approach and is very similar to the architecture depicted for the rCUDA middleware in Figure 2.
CUDA-based GPU virtualization frameworks may be classified into two types: (1) those intended to be used in the context of virtual machines and (2) those devised as general-purpose virtualization frameworks to be used in native domains, although the client part of these latter solutions may also be used within a virtual machine. Frameworks in the first category usually make use of shared-memory mechanisms to transfer data from main memory inside the virtual machine to the GPU in the native domain, whereas the general-purpose virtualization frameworks in the FIGURE 11 Architecture usually deployed by GPU virtualization frameworks FIGURE 12 Comparison between the theoretical bandwidth of different versions of PCI Express x16 and those of commercialized InfiniBand fabrics and network adapters second type make use of the network fabric in the cluster to transfer data from the main memory in the client side to the remote GPU located in the server. This is why these latter solutions are commonly known as remote GPU virtualization frameworks.
Regarding the first type of GPU virtualization frameworks mentioned above, several solutions have been developed to be specifically used within virtual machines, as for example vCUDA, 17 GViM, 18 gVirtuS, 16 and Shadowfax. 40  it is mainly intended to be used in virtual machines, granting them access to the real GPU located in the same node, it also provides TCP/IP communications for remote GPU virtualization, thus allowing applications in a nonvirtualized environment to access GPUs located in other nodes.
Regarding Shadowfax, this solution allows Xen virtual machines to access the GPUs located at the same node, although it may also be used to access GPUs at other nodes of the cluster. It supports the obsolete CUDA version 1.1, and additionally, neither the source code nor the binaries are available to evaluate its performance.
In the second type of virtualization frameworks mentioned above, which provide general-purpose GPU virtualization, one can find rCUDA, 11 V-GPU, 41 GridCuda, 14 DS-CUDA, 15 and Shadowfax II. 42 rCUDA was already described in Section 2.1. V-GPU is a recent tool supporting CUDA 4.0. Unfortunately, the information provided by the V-GPU authors is fuzzy, and there is no publicly available version that can be used for testing and comparison. GridCuda also offers access to remote GPUs in a cluster but supporting an old CUDA version (v2.3). Moreover, there is currently no publicly available version of GridCuda that can be used for testing. Regarding DS-CUDA, it integrates a more recent version of CUDA (4.1) and includes specific communication support for InfiniBand. However, DS-CUDA presents several strong limitations, such as not allowing data transfers with pinned memory. Finally, Shadowfax II is still under development, not presenting a stable version yet and its public information is not updated to reflect the current code status.
It is important to notice that, although remote GPU virtualization has traditionally introduced a nonnegligible overhead, given that applications do not access GPUs attached to the local PCIe link but rather access devices that are installed in other nodes of the cluster (traversing a network fabric with a lower bandwidth), this performance overhead has significantly been reduced due to the recent advances in networking technologies.
For example, as depicted in Figure 12, the theoretical bandwidth of the InfiniBand network is 12.5 GB/s when using the most recent Mellanox This has turned remote GPU virtualization frameworks into an appealing option.

Workload managers and job schedulers
In addition to Slurm, there is a myriad of job schedulers, such as PBSPro, 43 LoadLeveler, 44 Condor, 45 MOAB, 46 Load Sharing Facility, 47 DPCS,48 Quadrics RMS, 49 BPROC, 50 TORQUE, 51 OAR, 52 MAUI, 53 Sun Grid Engine, 54 etc. An in-depth analysis can be found in the work of Georgiou. 55 We briefly review in the following the most important ones to support our choice for Slurm.
The Portable Batch System (PBS) 43 is a commercial scheduler originally developed in NASA. Recently, support for GPU scheduling was introduced with two different flavors: (1) a simple approach in which just a single GPU job can be run at a time in a node, and (2) an advanced approach that allows several GPU jobs to be concurrently run in a node. In any case, sharing a GPU among several jobs is not allowed. Although PBS is portable, it mainly presents the concern of being single threaded and hence exhibits poor performance on large clusters.
LoadLeveler 44 is a commercial tool by IBM and supports very few non-IBM systems, thus reducing its portability to general clusters.
Furthermore, it presents a very poor scalability, requiring around 20 min to execute a trivial 8000-task 500-node assignment. 9 Condor 45 is a parallel job manager developed at the University of Wisconsin. It was the basis for LoadLeveler and presents an elaborated checkpoint/restart feature and an interesting advertising system that allows servers to announce their available resources and consumers to disclose their requirements, so that a broker can later perform matches among them. Although the source code of Condor is available, thus making it a good candidate for introducing virtual GPU awareness, this batch system is less used than Slurm, which is used in several of the systems in the TOP500 list. 56 MOAB 46 is a commercial scheduler derived from the PBS one. It provides the usual features in this kind of systems (backfilling, firt come first serve, preemption, advance reservation, etc) but given that this is only a scheduler, it requires to be complemented with a resource manager system. Distributed Production Control System 48 was developed at Lawrence Livermore National Laboratory and provides support only for a few computing systems. Furthermore, it requires an underlying infrastructure for parallel job management.
Quadrics RMS 49 is intended for Unix systems leveraging the Quadrics Eln interconnects, thus making its usability quite limited. Moreover, it is proprietary, thus making it impossible to use it for the purposes of the work presented in this paper.
The Beowulf Distributed Process Space 50 is a suit of libraries and utilities that allow to start processes in a Beowulf-style cluster. However, achieving scalability with this tool can be difficult.
The rest of schedulers mentioned above present similar concerns. On the contrary, Slurm is a simple, highly scalable and portable resource management system, which additionally is open-source and widely used in the high-performance computing community. These characteristics made us to select Slurm as the target for the modifications to achieve virtual GPU awareness. Actually, Slurm has also been previously used for including new features. For example, in the work of Soner and Özturan 57 , an integer programming-based heterogeneous CPU-GPU cluster scheduler was proposed for Slurm. However, this work did not consider the use of virtual GPUs. Also, in the work of Soner et al, 58 the use of GPU ranges is proposed. Such a feature can be very useful to runtime autotuning applications and systems that can make use of a variable number of GPUs. However, this work does not consider the use of virtual GPUs that are decoupled from the CPU cores. Furthermore, in the work of Sabin and Sadayappan 59 , the reliability of job schedulers, with a focus on Slurm, is analyzed and two proposals are made. Finally, in the work of Balle and Palermo, 60 the features provided by Slurm are enhanced with multicore/multithreaded support. As can be seen, Slurm has been extended many times to consider new technology trends.

CONCLUSIONS
In this paper, we have carried out a thorough performance evaluation of a cluster using a modified version of Slurm, which is able to schedule the use of the virtual GPUs provided by the rCUDA middleware. The main idea is that the rCUDA middleware decouples GPUs from the nodes where they are installed, therefore making the scheduling process much more flexible at the same time that a better usage of resources is achieved.
Results from the execution of seven different workloads composed of eight applications in three different cluster configurations suggest that cluster performance can be noticeably increased just by modifying the Slurm scheduler and introducing rCUDA in the cluster. It is also expected that as GPUs feature larger memory sizes, the benefits presented in this work will become also larger.