Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs
Visualitza/
Impacte
Scholar |
Altres documents de l'autoria: Barrachina Mir, Sergio; Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Mestre Miravet, Jose Ignacio
Metadades
Mostra el registre complet de l'elementcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadades
Títol
Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUsAutoria
Data de publicació
2021-08-30Editor
SpringerCita bibliogràfica
Barrachina, S., Castelló, A., Catalán, M. et al. Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs. Computing 105, 915–934 (2023). https://doi.org/10.1007/s00607-021-00997-9Tipus de document
info:eu-repo/semantics/articleVersió
info:eu-repo/semantics/publishedVersionParaules clau / Matèries
Resum
In this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons ... [+]
In this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) in charge of modeling the NVIDIA cuDNN/cuBLAS library kernels involved in the training of some of the state-of-the-art CNNs; and ii) an analytical model in charge of modeling the NVIDIA NCCL Allreduce collective primitive using the Ring algorithm. The CNN training scalability study performed using this model in combination with the Roofline technique on varying batch sizes, node (floating-point) arithmetic performance, node memory bandwidth, network link bandwidth, and cluster dimension unveil some crucial bottlenecks at both GPU and cluster level. To provide evidence of this analysis, we validate the accuracy of the proposed model against a Python library for distributed deep learning training. [-]
Publicat a
Computing 105, 915–934 (2023)Entitat finançadora
CRUE-CSIC agreement with Springer Nature | Ministerio de Ciencia, Innovación y Universidades (Spain) | Generalitat Valenciana
Codi del projecte o subvenció
TIN2017-82972-R | Prometeo/2019/109 | Plan GenT project CDEIGENT/2018/014
Títol del projecte o subvenció
Open Access funding provided
Drets d'accés
© The Author(s) 2021
info:eu-repo/semantics/openAccess
info:eu-repo/semantics/openAccess
Apareix a les col.leccions
- ICC_Articles [427]