Architecture-Aware Con guration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Visualitza/
Impacte
Scholar |
Altres documents de l'autoria: Catalán, Sandra; Igual, Francisco D.; Mayo, Rafael; Rodríguez Sánchez, Rafael; Quintana-Orti, Enrique S.
Metadades
Mostra el registre complet de l'elementcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadades
Títol
Architecture-Aware Con guration and Scheduling of Matrix Multiplication on Asymmetric Multicore ProcessorsAutoria
Data de publicació
2016-09Editor
Springer USISSN
1386-7857; 1573-7543Cita bibliogràfica
CATALÁN, Sandra, et al. Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore ProcessorsTipus de document
info:eu-repo/semantics/articleVersió de l'editorial
http://link.springer.com/article/10.1007/s10586-016-0611-8Paraules clau / Matèries
Resum
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially
in mobile appliances where heterogeneity in applications is mainstream. ... [+]
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially
in mobile appliances where heterogeneity in applications is mainstream. In
addition, given the growing interest for low-power high performance computing,
this type of architectures is also being investigated as a means to
improve the throughput-per-Watt of complex scienti c applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key
operation of the BLAS, in order to obtain a high performance implementation
for ARM big.LITTLE AMPs. Our solution is based on the reference
implementation of gemm in the BLIS library, and integrates a cache-aware
con guration as well as asymmetric{static and dynamic scheduling strategies
that carefully tune and distribute the operation's micro-kernels among
the big and LITTLE cores of the target processor. The experimental results
on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and
Cortex-A7 clusters that implements the big.LITTLE model, expose that our
cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts
while exploiting all the resources of the AMP to deliver considerable energy
effciency. [-]
Publicat a
Cluster Computing, 2016, vol. 19, núm. 3Drets d'accés
© Springer Science+Business Media New York 2016
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
Apareix a les col.leccions
- ICC_Articles [413]