Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks

Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Quintana-Orti, Enrique S.; Duato, José

dc.contributor.author	Castelló, Adrián
dc.contributor.author	Catalán Carbó, Mar
dc.contributor.author	Dolz, Manuel F.
dc.contributor.author	Quintana-Orti, Enrique S.
dc.contributor.author	Duato, José
dc.date.accessioned	2022-02-16T12:12:36Z
dc.date.available	2022-02-16T12:12:36Z
dc.date.issued	2022-01-10
dc.identifier.citation	Castelló, A., Catalán, M., Dolz, M.F. et al. Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks. Computing, 105, 1101–1119 (2023). https://doi.org/10.1007/s00607-021-01029-2	ca_CA
dc.identifier.uri	http://hdl.handle.net/10234/196782
dc.description.abstract	For many distributed applications, data communication poses an important bottleneck from the points of view of performance and energy consumption. As more cores are integrated per node, in general the global performance of the system increases yet eventually becomes limited by the interconnection network. This is the case for distributed data-parallel training of convolutional neural networks (CNNs), which usually proceeds on a cluster with a small to moderate number of nodes. In this paper, we analyze the performance of the Allreduce collective communication primitive, a key to the efficient data-parallel distributed training of CNNs. Our study targets the distinct realizations of this primitive in three high performance instances of Message Passing Interface (MPI), namely MPICH, OpenMPI, and IntelMPI, and employs a cluster equipped with state-of-the-art processor and network technologies. In addition, we apply the insights gained from the experimental analysis to the optimization of the TensorFlow framework when running on top of Horovod. Our study reveals that a careful selection of the most convenient MPI library and Allreduce (ARD) realization accelerates the training throughput by a factor of 1.2× compared with the default algorithm in the same MPI library, and up to 2.8× when comparing distinct MPI libraries in a number of relevant combinations of CNN model+dataset.	ca_CA
dc.format.extent	19 p.	ca_CA
dc.format.mimetype	application/pdf	ca_CA
dc.language.iso	eng	ca_CA
dc.publisher	Springer	ca_CA
dc.relation.isPartOf	Computing (2023)	ca_CA
dc.rights	© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2021	ca_CA
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	ca_CA
dc.subject	message passing interface (MPI)	ca_CA
dc.subject	collective communication primitives	ca_CA
dc.subject	Allreduce	ca_CA
dc.subject	deep learning	ca_CA
dc.subject	distributed training	ca_CA
dc.title	Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks	ca_CA
dc.type	info:eu-repo/semantics/article	ca_CA
dc.identifier.doi	https://doi.org/10.1007/s00607-021-01029-2
dc.rights.accessRights	info:eu-repo/semantics/openAccess	ca_CA
dc.type.version	info:eu-repo/semantics/publishedVersion	ca_CA
project.funder.name	Ministerio de Ciencia, Innovación y Universidades (Spain)	ca_CA
project.funder.name	Generalitat Valenciana	ca_CA
oaire.awardNumber	TIN2017-82972-R	ca_CA
oaire.awardNumber	Prometeo/2019/109	ca_CA
oaire.awardNumber	CDEIGENT/2018/014	ca_CA
oaire.awardNumber	FJC2019-039222-I	ca_CA

Fitxers en aquest element

Name:: 79780.pdf
Grandària:: 1.068Mb
Format:: PDF
Description:: Versió editorial

Visualitza/

Aquest element apareix en la col·lecció o col·leccions següent(s)

ICC_Articles [423]

Mostra el registre parcial de l'element