• closedAccess   A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures 

      Quintana-Ortí, Gregorio; Igual, Francisco D.; Marqués-Andrés, Mercedes; Quintana-Orti, Enrique S.; Van de Geijn, Robert A. ACM (2012-08)
      Out-of-core implementations of algorithms for dense matrix computations have traditionally focused on optimal use of memory so as to minimize I/O, often trading programmability for performance. In this article we show how ...
    • openAccess   An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization 

      Quintana-Ortí, Gregorio; Quintana-Orti, Enrique S.; Remón Gómez, Alfredo; Van de Geijn, Robert A. Springer Verlag (2008)
      We pursue the scalable parallel implementation of the factor- ization of band matrices with medium to large bandwidth targeting SMP and multi-core architectures. Our approach decomposes the computation into a large ...
    • closedAccess   Families of Algorithms for Reducing a Matrix to Condensed Form 

      Van Zee, Field G.; Van de Geijn, Robert A.; Quintana-Ortí, Gregorio; Elizondo, G. Joseph ACM (2012-11)
      In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The ...
    • openAccess   Programming matrix algorithms-by-blocks for thread-level parallelism 

      Quintana-Ortí, Gregorio; Quintana-Orti, Enrique S.; Van de Geijn, Robert A.; Van Zee, Field G.; Chan, Ernie Association for Computing Machinery (2009-07)
      With the emergence of thread-level parallelism as the primary means for continued improvement of performance, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that ...
    • openAccess   Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance 

      Van Zee, Field G.; Van de Geijn, Robert A.; Quintana-Ortí, Gregorio ACM Digital Library (2014-04)
      We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they be- come rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly ...