Visualitza ICC_Articles per autoria "46ea45b0-d88f-4f00-86fa-bae29663d7e1"

Analysis of Threading Libraries for High Performance Computing

Castelló, Adrián; Mayo, Rafael; Seo, Sangmin; Balaji, Pavan; Quintana-Orti, Enrique S.; Peña Monferrer, Antonio J. IEEE (2020-01-30)

With the appearance of multi-many core machines, applications and runtime systems evolved in order to exploit the new on-node concurrency that brought new software paradigms. POSIX threads (Pthreads) was widely-adopted for ...

Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks

Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Quintana-Orti, Enrique S.; Duato, José Springer (2022-01-10)

For many distributed applications, data communication poses an important bottleneck from the points of view of performance and energy consumption. As more cores are integrated per node, in general the global performance ...

Argobots: A Lightweight Low-Level Threading and Tasking Framework

Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan; Bordage, Cyril; Bosilca, George; Brooks, Alex; Carns, Philip; Castelló, Adrián; Genet, Damien; Herault, Thomas; Iwasaki, Shintaro; Jindal, Prateek; Kalé, Laxmikant V.; Krishnamoorthy, Sriram; Lifflander, Jonathan; Lu, Huiwei; Meneses, Esteban; Snir, Marc; Sun, Yanhua; Taura, Kenjiro; Beckman, Pete IEEE (2017-10)

In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current ...

BestOf: an online implementation selector for the training and inference of deep neural networks

Barrachina Mir, Sergio; Castelló, Adrián; Dolz, Manuel F.; Tomás, Andrés E. Springer (2022-05-20)

Tuning and optimising the operations executed in deep learning frameworks is a fundamental task in accelerating the processing of deep neural networks (DNNs). However, this optimisation usually requires extensive manual ...

Efficient and portable Winograd convolutions for multi-core processors

Dolz, Manuel F.; Martínez, Héctor; Castelló, Adrián; Alonso-Jordá, Pedro; Quintana-Orti, Enrique S. Springer (2023-02-12)

We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, ...

Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models

Castelló, Adrián; Pena, Antonio J.; Mayo, Rafael; Planas, Judit; Quintana-Orti, Enrique S.; Balaji, Pavan Springer (2016-06-21)

Directive-based programming models, such as OpenMP, OpenACC, and OmpSs, enable users to accelerate applications by using coprocessors with little effort. These devices offer significant computing power, but their use can ...

High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS

Castelló, Adrián; Barrachina Mir, Sergio; Dolz, Manuel F.; Quintana-Orti, Enrique S.; San Juan, Pau; Tomás Domínguez, Andrés Enrique Elsevier (2022-03-22)

We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors ...

Improving the user experience of the rCUDA remote GPU virtualization framework

Reaño, Carlos; Silla, Federico; Castelló, Adrián; Peña Monferrer, Antonio J.; Mayo, Rafael; Quintana-Orti, Enrique S. Wiley (2014-10)

Graphics processing units (GPUs) are being increasingly embraced by the high-performance computing community as an effective way to reduce execution time by accelerating parts of their applications. remote CUDA (rCUDA) was ...

On the adequacy of lightweight thread approaches for high-level parallel programming models

Castelló, Adrián; Mayo, Rafael; Sala, Kevin; Beltran Querol, Vicenç; Balaji, Pavan; Peña Monferrer, Antonio J. Elsevier (2018-07)

High-level parallel programming models (PMs) are becoming crucial in order to extract the computational power of current on-node multi-threaded parallelism. The most popular PMs, such as OpenMP or OmpSs, are directive-based: ...

Performance–energy trade‑ofs of deep learning convolution algorithms on ARM processors

Dolz, Manuel F.; Barrachina Mir, Sergio; Martínez, Héctor; Castelló, Adrián; Maciá, Antonio; Fabregat Llueca, German; Tomás, Andrés E. Springer (2023)

In this work, we assess the performance and energy efciency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) ...

Programming parallel dense matrix factorizations with look-ahead and OpenMP

Catalán, Sandra; Castelló, Adrián; Igual, Francisco; Rodríguez Sánchez, Rafael; Quintana-Orti, Enrique S. Springer (2019)

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded ...

PyDTNN: A user-friendly and extensible framework for distributed deep learning

Barrachina Mir, Sergio; Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Mestre Miravet, Jose Ignacio Springer (2021-02-22)

We introduce a framework for training deep neural networks on clusters of computers with the following appealing properties: (1) It is developed in Python, exposing an amiable interface that provides an accessible entry ...

Reformulating the direct convolution for high-performance deep learning inference on ARM processors

Barrachina Mir, Sergio; Castelló, Adrián; Dolz, Manuel F.; Low, Tze Meng; Martinez, Hector; Quintana-Orti, Enrique S.; Upasana, Sridhar; Tomás Domínguez, Andrés Enrique Elsevier (2022-12-20)

We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based ...

Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs

Barrachina Mir, Sergio; Castelló, Adrián; Catalán Carbó, Mar; Dolz, Manuel F.; Mestre Miravet, Jose Ignacio Springer (2021-08-30)

In this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) ...