Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
Visualitza/
Impacte
Scholar |
Altres documents de l'autoria: Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Quintana-Orti, Enrique S.; Duato, José
Metadades
Mostra el registre complet de l'elementcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadades
Títol
Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networksAutoria
Data de publicació
2022-01-10Editor
SpringerCita bibliogràfica
Castelló, A., Catalán, M., Dolz, M.F. et al. Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks. Computing, 105, 1101–1119 (2023). https://doi.org/10.1007/s00607-021-01029-2Tipus de document
info:eu-repo/semantics/articleVersió
info:eu-repo/semantics/publishedVersionParaules clau / Matèries
Resum
For many distributed applications, data communication poses an important bottleneck
from the points of view of performance and energy consumption. As more cores
are integrated per node, in general the global perfo ... [+]
For many distributed applications, data communication poses an important bottleneck
from the points of view of performance and energy consumption. As more cores
are integrated per node, in general the global performance of the system increases
yet eventually becomes limited by the interconnection network. This is the case for
distributed data-parallel training of convolutional neural networks (CNNs), which
usually proceeds on a cluster with a small to moderate number of nodes. In this paper,
we analyze the performance of the Allreduce collective communication primitive, a
key to the efficient data-parallel distributed training of CNNs. Our study targets the
distinct realizations of this primitive in three high performance instances of Message
Passing Interface (MPI), namely MPICH, OpenMPI, and IntelMPI, and employs a
cluster equipped with state-of-the-art processor and network technologies. In addition,
we apply the insights gained from the experimental analysis to the optimization of the
TensorFlow framework when running on top of Horovod. Our study reveals that a
careful selection of the most convenient MPI library and Allreduce (ARD) realization
accelerates the training throughput by a factor of 1.2× compared with the default
algorithm in the same MPI library, and up to 2.8× when comparing distinct MPI
libraries in a number of relevant combinations of CNN model+dataset. [-]
Publicat a
Computing (2023)Entitat finançadora
Ministerio de Ciencia, Innovación y Universidades (Spain) | Generalitat Valenciana
Codi del projecte o subvenció
TIN2017-82972-R | Prometeo/2019/109 | CDEIGENT/2018/014 | FJC2019-039222-I
Drets d'accés
© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2021
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
Apareix a les col.leccions
- ICC_Articles [423]