Architecture-Aware Con guration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Ver/ Abrir
Impacto
Scholar |
Otros documentos de la autoría: Catalán, Sandra; Igual, Francisco; Mayo, Rafael; Rodríguez Sánchez, Rafael; Quintana-Orti, Enrique S.
Metadatos
Mostrar el registro completo del ítemcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadatos
Título
Architecture-Aware Con guration and Scheduling of Matrix Multiplication on Asymmetric Multicore ProcessorsAutoría
Fecha de publicación
2016-09Editor
Springer USISSN
1386-7857; 1573-7543Cita bibliográfica
CATALÁN, Sandra, et al. Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore ProcessorsTipo de documento
info:eu-repo/semantics/articleVersión de la editorial
http://link.springer.com/article/10.1007/s10586-016-0611-8Palabras clave / Materias
Resumen
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially
in mobile appliances where heterogeneity in applications is mainstream. ... [+]
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially
in mobile appliances where heterogeneity in applications is mainstream. In
addition, given the growing interest for low-power high performance computing,
this type of architectures is also being investigated as a means to
improve the throughput-per-Watt of complex scienti c applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key
operation of the BLAS, in order to obtain a high performance implementation
for ARM big.LITTLE AMPs. Our solution is based on the reference
implementation of gemm in the BLIS library, and integrates a cache-aware
con guration as well as asymmetric{static and dynamic scheduling strategies
that carefully tune and distribute the operation's micro-kernels among
the big and LITTLE cores of the target processor. The experimental results
on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and
Cortex-A7 clusters that implements the big.LITTLE model, expose that our
cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts
while exploiting all the resources of the AMP to deliver considerable energy
effciency. [-]
Publicado en
Cluster Computing, 2016, vol. 19, núm. 3Derechos de acceso
© Springer Science+Business Media New York 2016
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
Aparece en las colecciones
- ICC_Articles [417]