Architecture-Aware Con guration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
![Thumbnail](/xmlui/bitstream/handle/10234/163597/73857.pdf.jpg?sequence=6&isAllowed=y)
View/ Open
Impact
![Google Scholar](/xmlui/themes/Mirage2/images/uji/logo_google.png)
![Microsoft Academico](/xmlui/themes/Mirage2/images/uji/logo_microsoft.png)
Metadata
Show full item recordcomunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONMetadata
Title
Architecture-Aware Con guration and Scheduling of Matrix Multiplication on Asymmetric Multicore ProcessorsAuthor (s)
Date
2016-09Publisher
Springer USISSN
1386-7857; 1573-7543Bibliographic citation
CATALÁN, Sandra, et al. Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore ProcessorsType
info:eu-repo/semantics/articlePublisher version
http://link.springer.com/article/10.1007/s10586-016-0611-8Subject
Abstract
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially
in mobile appliances where heterogeneity in applications is mainstream. ... [+]
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially
in mobile appliances where heterogeneity in applications is mainstream. In
addition, given the growing interest for low-power high performance computing,
this type of architectures is also being investigated as a means to
improve the throughput-per-Watt of complex scienti c applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key
operation of the BLAS, in order to obtain a high performance implementation
for ARM big.LITTLE AMPs. Our solution is based on the reference
implementation of gemm in the BLIS library, and integrates a cache-aware
con guration as well as asymmetric{static and dynamic scheduling strategies
that carefully tune and distribute the operation's micro-kernels among
the big and LITTLE cores of the target processor. The experimental results
on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and
Cortex-A7 clusters that implements the big.LITTLE model, expose that our
cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts
while exploiting all the resources of the AMP to deliver considerable energy
effciency. [-]
Is part of
Cluster Computing, 2016, vol. 19, núm. 3Rights
© Springer Science+Business Media New York 2016
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/openAccess
This item appears in the folowing collection(s)
- ICC_Articles [425]