Automatic generation of ARM NEON micro‑kernels for matrix multiplication
Ver/ Abrir
Impacto
Scholar |
Otros documentos de la autoría: Alaejos, Guillermo; Martínez, Héctor; Castelló, Adrián; Dolz, Manuel F.; Igual, Francisco; Alonso-Jordá, Pedro; Quintana-Orti, Enrique S.
Metadatos
Mostrar el registro completo del ítemcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadatos
Título
Automatic generation of ARM NEON micro‑kernels for matrix multiplicationAutoría
Fecha de publicación
2024-03-12Editor
SpringerISSN
0920-8542; 1573-0484Cita bibliográfica
Alaejos, G., Martínez, H., Castelló, A. et al. Automatic generation of ARM NEON micro-kernels for matrix multiplication. J Supercomput (2024). https://doi.org/10.1007/s11227-024-05955-8Tipo de documento
info:eu-repo/semantics/articleVersión
info:eu-repo/semantics/publishedVersionPalabras clave / Materias
Resumen
General matrix multiplication (gemm) is a fundamental kernel in scientifc computing
and current frameworks for deep learning. Modern realisations of gemm are mostly
written in C, on top of a small, highly tuned ... [+]
General matrix multiplication (gemm) is a fundamental kernel in scientifc computing
and current frameworks for deep learning. Modern realisations of gemm are mostly
written in C, on top of a small, highly tuned micro-kernel that is usually encoded in
assembly. The high performance realisation of gemm in linear algebra libraries in general include a single micro-kernel per architecture, usually implemented by an expert.
In this paper, we explore a couple of paths to automatically generate gemm micro-kernels, either using C++ templates with vector intrinsics or high-level Python scripts that
directly produce assembly code. Both solutions can integrate high performance software techniques, such as loop unrolling and software pipelining, accommodate any data
type, and easily generate micro-kernels of any requested dimension. The performance
of this solution is tested on three ARM-based cores and compared with state-of-the-art
libraries for these processors: BLIS, OpenBLAS and ArmPL. The experimental results
show that the auto-generation approach is highly competitive, m [-]
Entidad financiadora
CRUE-CSIC agreement with Springer Nature | European Commission, European Union | Junta de Andalucía | Agencia Estatal de Investigación | Generalitat Valenciana
Código del proyecto o subvención
95555 | POSTDOC_21_00025 | FJC2019-039222 | PID2020-113656R | PID2021-12657NB-I00 | CIDEXG/2022/013 | PROMETEO 2023-CIPROM/2022/20
Derechos de acceso
© The Author(s) 2024
info:eu-repo/semantics/openAccess
info:eu-repo/semantics/openAccess
Aparece en las colecciones
- ICC_Articles [418]