Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
View/ Open
Impact
Scholar |
Other documents of the author: Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Quintana-Orti, Enrique S.; Duato, José
Metadata
Show full item recordcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadata
Title
Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networksAuthor (s)
Date
2022-01-10Publisher
SpringerBibliographic citation
Castelló, A., Catalán, M., Dolz, M.F. et al. Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks. Computing, 105, 1101–1119 (2023). https://doi.org/10.1007/s00607-021-01029-2Type
info:eu-repo/semantics/articleVersion
info:eu-repo/semantics/publishedVersionSubject
Abstract
For many distributed applications, data communication poses an important bottleneck
from the points of view of performance and energy consumption. As more cores
are integrated per node, in general the global perfo ... [+]
For many distributed applications, data communication poses an important bottleneck
from the points of view of performance and energy consumption. As more cores
are integrated per node, in general the global performance of the system increases
yet eventually becomes limited by the interconnection network. This is the case for
distributed data-parallel training of convolutional neural networks (CNNs), which
usually proceeds on a cluster with a small to moderate number of nodes. In this paper,
we analyze the performance of the Allreduce collective communication primitive, a
key to the efficient data-parallel distributed training of CNNs. Our study targets the
distinct realizations of this primitive in three high performance instances of Message
Passing Interface (MPI), namely MPICH, OpenMPI, and IntelMPI, and employs a
cluster equipped with state-of-the-art processor and network technologies. In addition,
we apply the insights gained from the experimental analysis to the optimization of the
TensorFlow framework when running on top of Horovod. Our study reveals that a
careful selection of the most convenient MPI library and Allreduce (ARD) realization
accelerates the training throughput by a factor of 1.2× compared with the default
algorithm in the same MPI library, and up to 2.8× when comparing distinct MPI
libraries in a number of relevant combinations of CNN model+dataset. [-]
Is part of
Computing (2023)Funder Name
Ministerio de Ciencia, Innovación y Universidades (Spain) | Generalitat Valenciana
Project code
TIN2017-82972-R | Prometeo/2019/109 | CDEIGENT/2018/014 | FJC2019-039222-I
Rights
© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2021
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
This item appears in the folowing collection(s)
- ICC_Articles [419]