Buscar
Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSs
(Elsevier, 2014)
In this paper, we investigate how to exploit task-parallelism during the execution of the
Cholesky factorization on clusters of multicore processors with the SMPSs programming
model. Our analysis reveals that the major ...
Analytical Modeling is Enough for High Performance BLIS
(ACM, 2016-09)
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning ...
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance
(ACM Digital Library, 2014-04)
We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they be-
come rich in operations that can achieve near-peak performance on a modern processor. The key is a
novel, cache-friendly ...
Time and energy modeling of a high-performance multi-threaded Cholesky factorization
(Springer, 2016-02-05)
We present accurate time and energy piece-wise models of high-performance multi-threaded implementations for the general matrix multiplication, triangular system solve with multiple right-hand sides, and symmetric rank-k ...
Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations
(Elsevier, 2017)
Dealing with asymmetry in the architecture opens a plethora of questions related with
the performance- and energy-efficient scheduling of task-parallel applications. While there
exist early attempts to tackle this problem, ...
Time and energy modeling of high–performance Level-3 BLAS on x86 architectures
(Elsevier, 2015-06)
We present accurate piece-wise models for the time and energy costs of high performance implementations of both the matrix multiplication (gemm) and the triangular system solve with multiple right-hand sides (trsm) on x86 ...
Programming matrix algorithms-by-blocks for thread-level parallelism
(Association for Computing Machinery, 2009-07)
With the emergence of thread-level parallelism as the primary means for continued improvement of performance, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that ...
The libflame library for dense matrix computations
(IEEE Computer Society, 2009-11)
Researchers from the Formal Linear Algebra Method Environment (Flame) project have developed new methodologies for analyzing, designing, and implementing linear algebra libraries. These solutions, which have culminated in ...
Attaining High Performance in General-Purpose Computations on Current Graphics Processors
(Departament d' Enginyeria i Ciència dels Computadors, Universitat Jaume I, 2008-01)
The increase in performance of the last generations of graphics processors
(GPUs) has made this class of hardware a coprocessing platform of remarkable
success in certain types of operations. In this paper we evaluate ...
Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors
(Departament d' Enginyeria i Ciència dels Computadors, Universitat Jaume I, 2008-01)
The increase in performance of the last generations of graphics processors (GPUs) has made this class of platform a
coprocessing tool with remarkable success in certain types of operations. In this paper we evaluate the ...