A complete and efficient CUDA-sharing solution for HPC clusters
Impacto
Scholar |
Otros documentos de la autoría: Peña Monferrer, Antonio J.; Reaño, Carlos; Silla, Federico; Mayo, Rafael; Quintana-Orti, Enrique S.; Duato, José
Metadatos
Mostrar el registro completo del ítemcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONEste recurso está restringido
http://dx.doi.org/10.1016/j.parco.2014.09.011 |
Metadatos
Título
A complete and efficient CUDA-sharing solution for HPC clustersAutoría
Fecha de publicación
2014Editor
ElsevierISSN
0167-8191Tipo de documento
info:eu-repo/semantics/articleVersión de la editorial
http://www.sciencedirect.com/science/article/pii/S0167819114001227#Palabras clave / Materias
Resumen
In this paper we detail the key features, architectural design, and implementation of rCUDA,
an advanced framework to enable remote and transparent GPGPU acceleration in HPC
clusters. rCUDA allows decoupling GPUs ... [+]
In this paper we detail the key features, architectural design, and implementation of rCUDA,
an advanced framework to enable remote and transparent GPGPU acceleration in HPC
clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators,
which brings enhanced flexibility to cluster configurations. This opens the door to configurations
with fewer accelerators than nodes, as well as permits a single node to exploit the
whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly
interact with any GPU in the cluster, independently of its physical location. Thus,
GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU
servers, depending on the cluster administrator’s policy. This proposal leads to savings not
only in space but also in energy, acquisition, and maintenance costs. The performance evaluation
in this paper with a series of benchmarks and a production application clearly demonstrates
the viability of this proposal. Concretely, experiments with the matrix–matrix
product reveal excellent performance compared with regular executions on the local
GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up
to 11x speedup employing 8 remote accelerators from a single node with respect to a
12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration
in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization
frameworks are also evaluated. [-]
Publicado en
Parallel Computing Volume 40, Issue 10, December 2014Derechos de acceso
© 2014 Elsevier B.V. All rights reserved.
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/restrictedAccess
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/restrictedAccess
Aparece en las colecciones
- ICC_Articles [413]