Automatic generation of ARM NEON micro‑kernels for matrix multiplication

Alaejos, Guillermo; Martínez, Héctor; Castelló, Adrián; Dolz, Manuel F.; Igual, Francisco; Alonso-Jordá, Pedro; Quintana-Orti, Enrique S.

dc.contributor.author	Alaejos, Guillermo
dc.contributor.author	Martínez, Héctor
dc.contributor.author	Castelló, Adrián
dc.contributor.author	Dolz, Manuel F.
dc.contributor.author	Igual, Francisco
dc.contributor.author	Alonso-Jordá, Pedro
dc.contributor.author	Quintana-Orti, Enrique S.
dc.date.accessioned	2024-05-15T07:45:24Z
dc.date.available	2024-05-15T07:45:24Z
dc.date.issued	2024-03-12
dc.identifier.citation	Alaejos, G., Martínez, H., Castelló, A. et al. Automatic generation of ARM NEON micro-kernels for matrix multiplication. J Supercomput (2024). https://doi.org/10.1007/s11227-024-05955-8	ca_CA
dc.identifier.issn	0920-8542
dc.identifier.issn	1573-0484
dc.identifier.uri	http://hdl.handle.net/10234/207342
dc.description.abstract	General matrix multiplication (gemm) is a fundamental kernel in scientifc computing and current frameworks for deep learning. Modern realisations of gemm are mostly written in C, on top of a small, highly tuned micro-kernel that is usually encoded in assembly. The high performance realisation of gemm in linear algebra libraries in general include a single micro-kernel per architecture, usually implemented by an expert. In this paper, we explore a couple of paths to automatically generate gemm micro-kernels, either using C++ templates with vector intrinsics or high-level Python scripts that directly produce assembly code. Both solutions can integrate high performance software techniques, such as loop unrolling and software pipelining, accommodate any data type, and easily generate micro-kernels of any requested dimension. The performance of this solution is tested on three ARM-based cores and compared with state-of-the-art libraries for these processors: BLIS, OpenBLAS and ArmPL. The experimental results show that the auto-generation approach is highly competitive, m	ca_CA
dc.description.sponsorShip	Funding for open access charge: CRUE-Universitat Jaume I
dc.format.extent	27 p.	ca_CA
dc.format.mimetype	application/pdf	ca_CA
dc.language.iso	eng	ca_CA
dc.publisher	Springer	ca_CA
dc.rights	© The Author(s) 2024	ca_CA
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	ca_CA
dc.subject	matrix multiplication	ca_CA
dc.subject	ARM NEON	ca_CA
dc.subject	SIMD arithmetic units	ca_CA
dc.subject	high performance	ca_CA
dc.title	Automatic generation of ARM NEON micro‑kernels for matrix multiplication	ca_CA
dc.type	info:eu-repo/semantics/article	ca_CA
dc.identifier.doi	https://doi.org/10.1007/s11227-024-05955-8
dc.rights.accessRights	info:eu-repo/semantics/openAccess	ca_CA
dc.type.version	info:eu-repo/semantics/publishedVersion	ca_CA
project.funder.name	CRUE-CSIC agreement with Springer Nature	ca_CA
project.funder.name	European Commission, European Union	ca_CA
project.funder.name	Junta de Andalucía	ca_CA
project.funder.name	Agencia Estatal de Investigación	ca_CA
project.funder.name	Generalitat Valenciana	ca_CA
oaire.awardNumber	95555	ca_CA
oaire.awardNumber	POSTDOC_21_00025	ca_CA
oaire.awardNumber	FJC2019-039222	ca_CA
oaire.awardNumber	PID2020-113656R	ca_CA
oaire.awardNumber	PID2021-12657NB-I00	ca_CA
oaire.awardNumber	CIDEXG/2022/013	ca_CA
oaire.awardNumber	PROMETEO 2023-CIPROM/2022/20	ca_CA
dc.subject.ods	9. Industria, innovacion e infraestructura	ca_CA

Ficheros en el ítem

Nombre:: Alaejos_2024_automatic.pdf
Tamaño:: 3.016Mb
Formato:: PDF

Ver/Abrir

Este ítem aparece en la(s) siguiente(s) colección(ones)

ICC_Articles [427]

Mostrar el registro sencillo del ítem