Mostrar el registro sencillo del ítem

dc.contributor.authorBarrachina Mir, Sergio
dc.contributor.authorCastelló, Adrián
dc.contributor.authorCatalán Carbó, Mar
dc.contributor.authorDolz, Manuel F.
dc.contributor.authorMestre Miravet, Jose Ignacio
dc.date.accessioned2021-10-14T12:05:23Z
dc.date.available2021-10-14T12:05:23Z
dc.date.issued2021-08-30
dc.identifier.citationBarrachina, S., Castelló, A., Catalán, M. et al. Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs. Computing 105, 915–934 (2023). https://doi.org/10.1007/s00607-021-00997-9ca_CA
dc.identifier.urihttp://hdl.handle.net/10234/195009
dc.description.abstractIn this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) in charge of modeling the NVIDIA cuDNN/cuBLAS library kernels involved in the training of some of the state-of-the-art CNNs; and ii) an analytical model in charge of modeling the NVIDIA NCCL Allreduce collective primitive using the Ring algorithm. The CNN training scalability study performed using this model in combination with the Roofline technique on varying batch sizes, node (floating-point) arithmetic performance, node memory bandwidth, network link bandwidth, and cluster dimension unveil some crucial bottlenecks at both GPU and cluster level. To provide evidence of this analysis, we validate the accuracy of the proposed model against a Python library for distributed deep learning training.ca_CA
dc.description.sponsorShipFunding for open access charge: CRUE-Universitat Jaume I
dc.format.extent20 p.ca_CA
dc.format.mimetypeapplication/pdfca_CA
dc.language.isoengca_CA
dc.publisherSpringerca_CA
dc.relationOpen Access funding providedca_CA
dc.relation.isPartOfComputing 105, 915–934 (2023)ca_CA
dc.rights© The Author(s) 2021ca_CA
dc.rights.urihttp://creativecommons.org/licenses/by-sa/4.0/ca_CA
dc.subjectdeep neural networks (DNNs)ca_CA
dc.subjectdistributed trainingca_CA
dc.subjectmulti-layer perceptron (MLP) based modelingca_CA
dc.subjectanalytical modelingca_CA
dc.subjectclustersca_CA
dc.subjectGPUsca_CA
dc.titleUsing machine learning to model the training scalability of convolutional neural networks on clusters of GPUsca_CA
dc.typeinfo:eu-repo/semantics/articleca_CA
dc.identifier.doihttps://doi.org/10.1007/s00607-021-00997-9
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca_CA
dc.type.versioninfo:eu-repo/semantics/publishedVersionca_CA
project.funder.nameCRUE-CSIC agreement with Springer Natureca_CA
project.funder.nameMinisterio de Ciencia, Innovación y Universidades (Spain)ca_CA
project.funder.nameGeneralitat Valencianaca_CA
oaire.awardNumberTIN2017-82972-Rca_CA
oaire.awardNumberPrometeo/2019/109ca_CA
oaire.awardNumberPlan GenT project CDEIGENT/2018/014ca_CA


Ficheros en el ítem

Thumbnail

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

© The Author(s) 2021
Excepto si se señala otra cosa, la licencia del ítem se describe como: © The Author(s) 2021