• openAccess   Analysis of Threading Libraries for High Performance Computing 

      Castelló, Adrián; Mayo, Rafael; Seo, Sangmin; Balaji, Pavan; Quintana-Orti, Enrique S.; Peña Monferrer, Antonio J. IEEE (2020-01-30)
      With the appearance of multi-many core machines, applications and runtime systems evolved in order to exploit the new on-node concurrency that brought new software paradigms. POSIX threads (Pthreads) was widely-adopted for ...
    • openAccess   Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks 

      Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Quintana-Orti, Enrique S.; Duato, José Springer (2022-01-10)
      For many distributed applications, data communication poses an important bottleneck from the points of view of performance and energy consumption. As more cores are integrated per node, in general the global performance ...
    • openAccess   Argobots: A Lightweight Low-Level Threading and Tasking Framework 

      Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan; Bordage, Cyril; Bosilca, George; Brooks, Alex; Carns, Philip; Castelló, Adrián; Genet, Damien; Herault, Thomas; Iwasaki, Shintaro; Jindal, Prateek; Kalé, Laxmikant V.; Krishnamoorthy, Sriram; Lifflander, Jonathan; Lu, Huiwei; Meneses, Esteban; Snir, Marc; Sun, Yanhua; Taura, Kenjiro; Beckman, Pete IEEE (2017-10)
      In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current ...
    • openAccess   BestOf: an online implementation selector for the training and inference of deep neural networks 

      Barrachina Mir, Sergio; Castelló, Adrián; Dolz, Manuel F.; Tomás, Andrés E. Springer (2022-05-20)
      Tuning and optimising the operations executed in deep learning frameworks is a fundamental task in accelerating the processing of deep neural networks (DNNs). However, this optimisation usually requires extensive manual ...
    • openAccess   Efficient and portable Winograd convolutions for multi-core processors 

      Dolz, Manuel F.; Martínez, Héctor; Castelló, Adrián; Alonso-Jordá, Pedro; Quintana-Orti, Enrique S. Springer (2023-02-12)
      We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, ...
    • openAccess   Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models 

      Castelló, Adrián; Pena, Antonio J.; Mayo, Rafael; Planas, Judit; Quintana-Orti, Enrique S.; Balaji, Pavan Springer (2016-06-21)
      Directive-based programming models, such as OpenMP, OpenACC, and OmpSs, enable users to accelerate applications by using coprocessors with little effort. These devices offer significant computing power, but their use can ...
    • openAccess   High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS 

      Castelló, Adrián; Barrachina Mir, Sergio; Dolz, Manuel F.; Quintana-Orti, Enrique S.; San Juan, Pau; Tomás Domínguez, Andrés Enrique Elsevier (2022-03-22)
      We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors ...
    • closedAccess   Improving the user experience of the rCUDA remote GPU virtualization framework 

      Reaño, Carlos; Silla, Federico; Castelló, Adrián; Peña Monferrer, Antonio J.; Mayo, Rafael; Quintana-Orti, Enrique S. Wiley (2014-10)
      Graphics processing units (GPUs) are being increasingly embraced by the high-performance computing community as an effective way to reduce execution time by accelerating parts of their applications. remote CUDA (rCUDA) was ...
    • closedAccess   On the adequacy of lightweight thread approaches for high-level parallel programming models 

      Castelló, Adrián; Mayo, Rafael; Sala, Kevin; Beltran Querol, Vicenç; Balaji, Pavan; Peña Monferrer, Antonio J. Elsevier (2018-07)
      High-level parallel programming models (PMs) are becoming crucial in order to extract the computational power of current on-node multi-threaded parallelism. The most popular PMs, such as OpenMP or OmpSs, are directive-based: ...
    • openAccess   Performance–energy trade‑ofs of deep learning convolution algorithms on ARM processors 

      Dolz, Manuel F.; Barrachina Mir, Sergio; Martínez, Héctor; Castelló, Adrián; Maciá, Antonio; Fabregat Llueca, German; Tomás, Andrés E. Springer (2023)
      In this work, we assess the performance and energy efciency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) ...
    • openAccess   Programming parallel dense matrix factorizations with look-ahead and OpenMP 

      Catalán, Sandra; Castelló, Adrián; Igual, Francisco; Rodríguez Sánchez, Rafael; Quintana-Orti, Enrique S. Springer (2019)
      We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded ...
    • openAccess   PyDTNN: A user-friendly and extensible framework for distributed deep learning 

      Barrachina Mir, Sergio; Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Mestre Miravet, Jose Ignacio Springer (2021-02-22)
      We introduce a framework for training deep neural networks on clusters of computers with the following appealing properties: (1) It is developed in Python, exposing an amiable interface that provides an accessible entry ...
    • openAccess   Reformulating the direct convolution for high-performance deep learning inference on ARM processors 

      Barrachina Mir, Sergio; Castelló, Adrián; Dolz, Manuel F.; Low, Tze Meng; Martinez, Hector; Quintana-Orti, Enrique S.; Upasana, Sridhar; Tomás Domínguez, Andrés Enrique Elsevier (2022-12-20)
      We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based ...
    • openAccess   Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs 

      Barrachina Mir, Sergio; Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Mestre Miravet, Jose Ignacio Springer (2021-08-30)
      In this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) ...