Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
Ver/ Abrir
Impacto
Scholar |
Otros documentos de la autoría: Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Quintana-Orti, Enrique S.; Duato, José
Metadatos
Mostrar el registro completo del ítemcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadatos
Título
Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networksAutoría
Fecha de publicación
2022-01-10Editor
SpringerCita bibliográfica
Castelló, A., Catalán, M., Dolz, M.F. et al. Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks. Computing, 105, 1101–1119 (2023). https://doi.org/10.1007/s00607-021-01029-2Tipo de documento
info:eu-repo/semantics/articleVersión
info:eu-repo/semantics/publishedVersionPalabras clave / Materias
Resumen
For many distributed applications, data communication poses an important bottleneck
from the points of view of performance and energy consumption. As more cores
are integrated per node, in general the global perfo ... [+]
For many distributed applications, data communication poses an important bottleneck
from the points of view of performance and energy consumption. As more cores
are integrated per node, in general the global performance of the system increases
yet eventually becomes limited by the interconnection network. This is the case for
distributed data-parallel training of convolutional neural networks (CNNs), which
usually proceeds on a cluster with a small to moderate number of nodes. In this paper,
we analyze the performance of the Allreduce collective communication primitive, a
key to the efficient data-parallel distributed training of CNNs. Our study targets the
distinct realizations of this primitive in three high performance instances of Message
Passing Interface (MPI), namely MPICH, OpenMPI, and IntelMPI, and employs a
cluster equipped with state-of-the-art processor and network technologies. In addition,
we apply the insights gained from the experimental analysis to the optimization of the
TensorFlow framework when running on top of Horovod. Our study reveals that a
careful selection of the most convenient MPI library and Allreduce (ARD) realization
accelerates the training throughput by a factor of 1.2× compared with the default
algorithm in the same MPI library, and up to 2.8× when comparing distinct MPI
libraries in a number of relevant combinations of CNN model+dataset. [-]
Publicado en
Computing (2023)Entidad financiadora
Ministerio de Ciencia, Innovación y Universidades (Spain) | Generalitat Valenciana
Código del proyecto o subvención
TIN2017-82972-R | Prometeo/2019/109 | CDEIGENT/2018/014 | FJC2019-039222-I
Derechos de acceso
© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2021
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
Aparece en las colecciones
- ICC_Articles [417]