Automatic generation of ARM NEON micro‑kernels for matrix multiplication
![Thumbnail](/xmlui/bitstream/handle/10234/207342/Alaejos_2024_automatic.pdf.jpg?sequence=4&isAllowed=y)
View/ Open
Impact
![Google Scholar](/xmlui/themes/Mirage2/images/uji/logo_google.png)
![Microsoft Academico](/xmlui/themes/Mirage2/images/uji/logo_microsoft.png)
Metadata
Show full item recordcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadata
Title
Automatic generation of ARM NEON micro‑kernels for matrix multiplicationAuthor (s)
Date
2024-03-12Publisher
SpringerISSN
0920-8542; 1573-0484Bibliographic citation
Alaejos, G., Martínez, H., Castelló, A. et al. Automatic generation of ARM NEON micro-kernels for matrix multiplication. J Supercomput (2024). https://doi.org/10.1007/s11227-024-05955-8Type
info:eu-repo/semantics/articleVersion
info:eu-repo/semantics/publishedVersionSubject
Abstract
General matrix multiplication (gemm) is a fundamental kernel in scientifc computing
and current frameworks for deep learning. Modern realisations of gemm are mostly
written in C, on top of a small, highly tuned ... [+]
General matrix multiplication (gemm) is a fundamental kernel in scientifc computing
and current frameworks for deep learning. Modern realisations of gemm are mostly
written in C, on top of a small, highly tuned micro-kernel that is usually encoded in
assembly. The high performance realisation of gemm in linear algebra libraries in general include a single micro-kernel per architecture, usually implemented by an expert.
In this paper, we explore a couple of paths to automatically generate gemm micro-kernels, either using C++ templates with vector intrinsics or high-level Python scripts that
directly produce assembly code. Both solutions can integrate high performance software techniques, such as loop unrolling and software pipelining, accommodate any data
type, and easily generate micro-kernels of any requested dimension. The performance
of this solution is tested on three ARM-based cores and compared with state-of-the-art
libraries for these processors: BLIS, OpenBLAS and ArmPL. The experimental results
show that the auto-generation approach is highly competitive, m [-]
Funder Name
CRUE-CSIC agreement with Springer Nature | European Commission, European Union | Junta de Andalucía | Agencia Estatal de Investigación | Generalitat Valenciana
Project code
95555 | POSTDOC_21_00025 | FJC2019-039222 | PID2020-113656R | PID2021-12657NB-I00 | CIDEXG/2022/013 | PROMETEO 2023-CIPROM/2022/20
Rights
© The Author(s) 2024
info:eu-repo/semantics/openAccess
info:eu-repo/semantics/openAccess
This item appears in the folowing collection(s)
- ICC_Articles [427]