Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs

Barrachina Mir, Sergio; Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Mestre Miravet, Jose Ignacio

dc.contributor.author	Barrachina Mir, Sergio
dc.contributor.author	Castelló, Adrián
dc.contributor.author	Catalán Carbó, Mar
dc.contributor.author	Dolz, Manuel F.
dc.contributor.author	Mestre Miravet, Jose Ignacio
dc.date.accessioned	2021-10-14T12:05:23Z
dc.date.available	2021-10-14T12:05:23Z
dc.date.issued	2021-08-30
dc.identifier.citation	Barrachina, S., Castelló, A., Catalán, M. et al. Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs. Computing 105, 915–934 (2023). https://doi.org/10.1007/s00607-021-00997-9	ca_CA
dc.identifier.uri	http://hdl.handle.net/10234/195009
dc.description.abstract	In this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) in charge of modeling the NVIDIA cuDNN/cuBLAS library kernels involved in the training of some of the state-of-the-art CNNs; and ii) an analytical model in charge of modeling the NVIDIA NCCL Allreduce collective primitive using the Ring algorithm. The CNN training scalability study performed using this model in combination with the Roofline technique on varying batch sizes, node (floating-point) arithmetic performance, node memory bandwidth, network link bandwidth, and cluster dimension unveil some crucial bottlenecks at both GPU and cluster level. To provide evidence of this analysis, we validate the accuracy of the proposed model against a Python library for distributed deep learning training.	ca_CA
dc.description.sponsorShip	Funding for open access charge: CRUE-Universitat Jaume I
dc.format.extent	20 p.	ca_CA
dc.format.mimetype	application/pdf	ca_CA
dc.language.iso	eng	ca_CA
dc.publisher	Springer	ca_CA
dc.relation	Open Access funding provided	ca_CA
dc.relation.isPartOf	Computing 105, 915–934 (2023)	ca_CA
dc.rights	© The Author(s) 2021	ca_CA
dc.rights.uri	http://creativecommons.org/licenses/by-sa/4.0/	ca_CA
dc.subject	deep neural networks (DNNs)	ca_CA
dc.subject	distributed training	ca_CA
dc.subject	multi-layer perceptron (MLP) based modeling	ca_CA
dc.subject	analytical modeling	ca_CA
dc.subject	clusters	ca_CA
dc.subject	GPUs	ca_CA
dc.title	Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs	ca_CA
dc.type	info:eu-repo/semantics/article	ca_CA
dc.identifier.doi	https://doi.org/10.1007/s00607-021-00997-9
dc.rights.accessRights	info:eu-repo/semantics/openAccess	ca_CA
dc.type.version	info:eu-repo/semantics/publishedVersion	ca_CA
project.funder.name	CRUE-CSIC agreement with Springer Nature	ca_CA
project.funder.name	Ministerio de Ciencia, Innovación y Universidades (Spain)	ca_CA
project.funder.name	Generalitat Valenciana	ca_CA
oaire.awardNumber	TIN2017-82972-R	ca_CA
oaire.awardNumber	Prometeo/2019/109	ca_CA
oaire.awardNumber	Plan GenT project CDEIGENT/2018/014	ca_CA

Ficheros en el ítem

Nombre:: 76662.pdf
Tamaño:: 681.5Kb
Formato:: PDF
Descripción:: Versió editorial

Ver/Abrir

Este ítem aparece en la(s) siguiente(s) colección(ones)

ICC_Articles [419]

Mostrar el registro sencillo del ítem