Listar por tema "high performance"
Mostrando ítems 1-9 de 9
-
Automatic generation of ARM NEON micro‑kernels for matrix multiplication
Springer (2024-03-12)General matrix multiplication (gemm) is a fundamental kernel in scientifc computing and current frameworks for deep learning. Modern realisations of gemm are mostly written in C, on top of a small, highly tuned micro-kernel ... -
Convolution Operators for Deep Learning Inference on the Fujitsu A64FX Processor
IEEE (2022)The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received ... -
Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors
Elsevier (2022-05-30)Convolutional Neural Networks (CNNs) play a crucial role in many image recognition and classification tasks, recommender systems, brain-computer interfaces, etc. As a consequence, there is a notable interest in developing ... -
Efficient and portable Winograd convolutions for multi-core processors
Springer (2023-02-12)We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, ... -
Exploiting the capabilities of modern GPUs for dense matrix computations
John Wiley & Sons (2009)We present several algorithms to compute the solution of a linear system of equations on a graphics processor (GPU), as well as general techniques to improve their performance, such as padding and hybrid GPU-CPU computation. ... -
Exploring the performance–power–energy balance of low-power multicore and manycore architectures for anomaly detection in remote sensing
Springer Verlag (2015)In this paper, we perform an experimental study of the interactions between execution time (i.e., performance), power, and energy that occur in modern low-power architectures when executing the RX algorithm for detecting ... -
randUTV: A Blocked Randomized Algorithm for Computing a Rank-Revealing UTV Factorization
Association for Computing Machinery (2019-03)A randomized algorithm for computing a so-called UTV factorization efficiently is presented. Given a matrix , the algorithm “randUTV” computes a factorization , where and have orthonormal columns, and is triangular ... -
Reformulating the direct convolution for high-performance deep learning inference on ARM processors
Elsevier (2022-12-20)We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based ... -
Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors
Springer (2020-01-24)We present a novel method for the QR factorization of large tall-and-skinny matrices that introduces an approximation technique for computing the Householder vectors. This approach is very competitive on a hybrid platform ...