Householder QR Factorization with Randomization for Column Pivoting (HQRRP). FLAME Working Note #78

A fundamental problem when adding column pivoting to the Householder QR factorization is that only about half of the computation can be cast in terms of high performing matrix-matrix multiplications, which greatly limits the benefits that can be derived from so-called blocking of algorithms. This paper describes a technique for selecting groups of pivot vectors by means of randomized projections. It is demonstrated that the asymptotic flop count for the proposed method is $2mn^2 - (2/3)n^3$ for an $m\times n$ matrix, identical to that of the best classical unblocked Householder QR factorization algorithm (with or without pivoting). Experiments demonstrate acceleration in speed of close to an order of magnitude relative to the {\sc geqp3} function in LAPACK, when executed on a modern CPU with multiple cores. Further, experiments demonstrate that the quality of the randomized pivot selection strategy is roughly the same as that of classical column pivoting. The described algorithm is made available under Open Source license and can be used with LAPACK or libflame.

1. Introduction. The QR factorization is a staple of linear algebra, with applications ranging from Linear Least-Squares solution of overdetermined systems to the identification of low rank approximation via the determination of an approximate orthonormal basis for the column space. Standard algorithms for computing the QR factorization include Gram-Schmidt orthogonalization and those based on Householder transformations (reflectors). When it is desirable for the QR factorization to also reveal the approximate rank of the original matrix, it is important that the elements of the diagonal of R be ordered with larger elements in magnitude appearing earlier. In this case, column pivoting (swapping) is employed during the QR factorization, yielding QR factorization with column pivoting (QRP). It is well-known that the Householder QR factorization (HQR) yields columns of Q that are orthogonal to a high degree of precision, making these algorithms the weapon of choice in many situations. Pivoting can be added to HQR to yield HQR with column pivoting (HQRP). This topic is covered by standard texts on numerical linear algebra [13].
To achieve high performance for dense linear algebra algorithms, so-called blocked algorithms are employed that cast most computation in terms of matrix-matrix operations supported by the widely used level-3 Basic Linear Algebra Subprograms (BLAS) [7,8] because such operations can be implemented to achieve very high performance on modern processors via a combination of careful reuse of data in the caches and low level implementation in terms of assembly code or intrinsic vector operations. Widely used current implementations of the level-3 BLAS are based on techniques exposed by Goto [15,14] and available in open source libraries including the OpenBLAS [39] (a fork of the GotoBLAS) and BLIS [36], as well as vendor implementions including AMD's ACML [2], Intel's MKL [20], and IBM's ESSL [19] libraries.
The fundamental problem with the classical approach to HQRP is that only half of the computation can be cast in terms of gemm, as described in the paper [31] that underlies LAPACK's geqp3 routine [3]. This means that blocking can only improve performance by, at best, a factor two, which is inherent from the fact that it must be known how remaining columns will be updated in order to compute the 2-norms of remaining columns. Bischof and Quintana-Ortí describe in a pair of papers [5,4] an attempt to overcome this problem by using so called "window pivoting" in combination with HQR. While much faster than geqp3, this approach is more complicated than the method proposed in this paper and never made it into LAPACK.
The present paper proposes to solve the problem by means of randomized projections. To describe the idea, suppose that we seek to determine a set of b good pivot columns in an m × n matrix A. We then draw a Gaussian random matrix G of size b × m and form a b × n sampling matrix Y = GA. Once Y is available, we execute QRP on this matrix to find the b pivot columns. This computation is efficient since Y is small compared to A (it has only b rows), and results in good pivot choices since the random projection produces a matrix Y that has approximately the same linear dependencies between its columns as does A. * With this observation, it becomes easy to block the Householder QR factorization with column pivoting. At each iteration of the blocked algorithm, we use the randomized sampling approach to identify a set of b columns that are then moved to the front of the actual matrix, at which point a regular step of HQR can be used to move the computation forward, optionally with additional column pivoting only within a narrow panel of the matrix. Importantly, the sampling matrix can be cheaply downdated rather than recomputed at each step, allowing the performance of the proposed algorithm to asymptotically approach that of a standard blocked HQR implementation that does not pivot columns. Fig. 1.1 illustrates the dramatic performance improvements that are realized.
The idea to use randomized sampling to pick blocks of pivot vectors was first published by Martinsson on ArXiv in May 2015 [28]. A very similar technique was published by Duersch and Gu in September 2015 [10], also on ArXiv. The observation that downdating of the sampling matrix enables the randomized scheme to attain the same asymptotic flop count as classical HQRP was discovered independently by the two groups and was first published in [10]. More broadly, the idea that one can select a subset of columns of a matrix that forms a good approximate basis for the column space of the matrix by performing QRP on a small matrix whose rows are random linear combinations of the rows of the original matrix was first described in [26,Sec. 4.1] and later elaborated in [23,18,27]. This problem is closely related to the problem of finding a set of columns of maximal spanning volume [16], and to the problem of finding so called CUR and interpolative decompositions [37]. These ideas tie in to a larger literature on randomized techniques for computing low-rank approximations of matrices that includes [12,9,24,25]. This paper describes a practical implementation of the proposed method that can be incorporated in libraries like LAPACK and libflame [34,35]. Implementation details that are important for attaining high practical performance are described to enable readers to reproduce and extend the ideas. The paper provides a cost analysis that shows that asymptotically the number of floating point operations approaches that of HQR without pivoting while most computation is cast in terms of matrixmatrix multiplication like the corresponding blocked HQR without pivoting. It reports unprecedented performance for pivoted QR factorization on current architectures and provides empirical quality results. Importantly, the implementation is made available for use by the computational science community under an Open Source license. The conclusion discusses how these results pave the way for future opportunities.
The paper is organized as follows: Section 2 lists some standard facts about Householder reflectors and pivoted QR factorizations that we need in the presentation. Section 3 describes how the classical Householder QR factorization algorithm can be blocked by using randomization in the pivot selection step. Section 4 reports the results from numerical experiments investigating the speed of the algorithm and the quality of the pivoting selection strategy. Sections 5 summarizes the key results and discusses future work. Section 6 describes publicly available software that implements the techniques presented.
2. Householder QR Factorization. In this section, we briefly review the state-of-the art regarding Householder factorization based on Householder transformations (HQR). Throughout, we use the FLAME notation for representing dense linear algebra algorithms [30,17]. In particular, for any matrix X, we let m(X) and n(X) denote the number of rows and columns of X, respectively.

Householder transformations (reflectors).
A standard topic in numerical linear algebra is the concept of a reflector, also known as a Householder transformation [13]. The review in this subsection follows [21] in which a similar notation is also employed.
Given a nonzero vector u ∈ C n , the matrix H(u) = I − 1 τ uu H with τ = u H u 2 has the property that it reflects a vector to which it is applied with respect to the subspace orthogonal to u. Given a vector x, the vector u and scalar τ can be chosen so that H(u)x equals a multiple of e 0 , the first column of the identiy matrix. Furthermore, u can be normalized so that its first element equals one.
In our discussions, given a vector x = χ 1 x 2 , the function ρ u 2 , τ := Housev χ 1 x 2 computes the vector u = 1 u 2 and τ = u H u 2 so that H(u)x = ρe 0 , 2.2. Unblocked Householder QR factorization. A standard unblocked algorithm for HQR of a given matrix A ∈ C m×n , typeset using the FLAME notation, is given in Fig. 2.1 (left). The body of the loop computes which overwrites a 11 with ρ 11 and a 21 with u 21 , after which the remainder of A is updated by Upon completion, the (Householder) vectors that define the Householder transformations have overwritten the elements in which they introduced zeroes, and the upper triangular part of A contains R. How the matrix T fits into the picture will become clear next.
2.3. The UT transform: Accumulating Householder transformations. Given A ∈ C n×b , let U contain the Householder vectors computed during the HQR of A. Let us assume that H(u b−1 ) · · · H(u 1 )H(u 0 )A = R. Then there exists an upper triangular matrix so that I − U T −H U H = H(u b−1 ) · · · H(u 1 )H(u 0 ). The desired matrix T equals the strictly upper triangular part of U H U with the diagonal elements equal to τ 0 , . . . , τ b−1 . The matrix T can be computed during the unblocked HQR, as indicated in Fig. 2.1 (left). In [21], the transformation I − U T −1 U H that equals the accumulated Householder transformations is called the UT transform. The UT transform is conceptually related to the more familiar WY transform [6] and compact WY transform [32], see [21] for details on how the different representations relate to one another.

A blocked QR Householder factorization algorithm.
A blocked algorithm for HQR that exploits the insights that resulted in the UT transform can now be described as follows. Partition We can use the unblocked algorithm in Fig. 2  (rank-b update). Such matrix-matrix multiplications can attain high performance by amortizing data movement between memory layers.

Continue with
These insights form the basis for the LAPACK routine geqrf (except that it uses a compact WY transform instead of the UT transform).
2.5. Householder QR factorization with column pivoting. An unblocked (rank-revealing) Householder QR factorization with column pivoting (HQRP) swaps the column of A BR with largest 2-norm with the first column of that matrix at the top of the loop body. As a result, the diagonal elements of matrix R are ordered from largest to smallest in magnitude, which, for example, allows the resulting QR factorization to be used to identify a high quality approximate low-rank orthonormal basis for the column space of A. (To be precise, column pivoted QR returns a high quality basis in most cases but may produce strongly sub-optimal results in rare situations. For details, see [22,16], and the description of "Matrix 4" in Section 4.2.) The fundamental problem with the best known algorithm for HQRP, which underlies LAPACK's routine geqp3, is that it only casts half of the computation in terms of matrix-matrix multiplication [31]. The unblocked algorithm called from the blocked algorithm operates on the entire "remaining matrix" (A BR in the blocked algorithm), computes b more Householder transforms and b more rows of R, computes the matrix W 2 , and returns the information about how columns were swapped. In the blocked algorithm itself, only the update A 22 − A 21 W 2 remains to be performed. When only half the computation can be cast in terms of matrix-matrix multiplication, the resulting blocked algorithm is only about twice as fast as the unblocked algorithm.
3. Randomization to the Rescue. This section describes a computationally efficient technique for picking a selection of b columns from a given n × n matrix A that form good choices for the first b pivots in a blocked HQRP algorithm. Observe that this task is closely related to the problem of finding an index set s of length b such that the columns in A(:, s) form a good approximate basis for the column space of A. Another way of expressing this problem is that we are looking for a collection of b columns whose spanning volume in C n is close to maximal. To find the absolutely optimal choice here is a hard problem [16], but luckily, for pivoting purposes it is sufficient to find a choice that is "good enough." 3.1. Randomized pivot selection. The strategy that we propose is simple. The idea is to perform classical QR factorization with column pivoting (QRP) on a matrix Y that is much smaller than A, so that performing QRP with that matrix constitutes a lower order cost. As a bonus, it may fit in fast cache memory. This smaller matrix can be constructed by forming random linear combinations of the rows of A as follows: 1. Fix an over-sampling parameter p. Setting p = 5 or p = 10 are good choices. 2. Form a random matrix G of size (b + p) × n whose entries are drawn independently from a normalized Gaussian distribution. 3. Form the (b + p) × n sampling matrix Y = GA. The sampling matrix Y has as many columns as A, but many fewer rows. Now execute b steps of a column pivoted QR factorization to determine an integer vector with b elements that capture how columns need to be pivoted: In other words, the columns Y (:, s) are good pivot columns for Y . Our claim is that due to the way Y is constructed, the columns A(:, s) are then also good choices for pivot columns of A. This claim is supported by extensive numerical experiments, some of which are given in Section 4.2. There is theory supporting the claim that these b columns form a good approximate basis for the column space of A, see, e.g. [18,Sec. 5.2] and [23,37], but it has not been directly proven that they form good choices as pivots in a QR factorization. This should not be surprising given that it is known that even classical column pivoting can result in poor choices [22]. Known algorithms that are provably good are all far more complex to implement [16]. Notice that there are many choices of algorithms that can be employed to determine the pivots. For example, since high numerical accuracy is not necessary, the classical Modified Gram-Schmidt (MGS) algorithm with column pivoting is a simple yet effective choice.

Continue with
The randomized strategy described here for determining a block of pivot vectors is inspired by a technique published in [26,Sec. 4.1] for computing a low-rank approximation to a matrix, and later elaborated in [23,18,27].
Remark 1 (Choice of over-sampling parameter p). The reliability of the procedures described in this section depends on the choice of the over-sampling parameter p. It is well understood how large p needs to be in order to determine a high-quality approximate basis for the column space of A with extremely high reliability: the choice p = 5 is very good, p = 10 is excellent, and p = b is almost always over-kill [18]. The pivot selection problem is less well studied, but is more forgiving. (The choice of pivots does not necessarily have to be particularly optimal.) Numerical experiments indicate that even setting p = 0 typically results in good choices. However, the choices p = 5 or p = 10 appear to be good generic values that have resulted in excellent choices in every experiment we have run.
Remark 2 (Intuition of random projections). To understand why the pivot columns selected by processing the small matrix Y also form good choices for the original matrix A, it might be helpful to observe that for a Gaussian random matrix G of size × n, it is the case that for any x ∈ R n , we have E Gx 2 = x 2 , where E denotes expectation. Moreover, as the number of rows grows, the probability distribution of Gx concentrates tightly around its expected value, see, e.g., [38,Sec. 2.4] and the references therin. This means that for any pair of indices i, j ∈ {1, 2, . . . , n} we have E Y (:, i) − Y (:, j) 2 = A(:, i) − A(:, j) 2 . This simple observation does not in any way provide a proof that the randomized strategy we propose works, but might help understand the underlying intuition.
3.2. Efficient downdating of the sampling matrix Y . For the QRP factorization algorithm, it is well known that one does not need to recompute the column norms of the remainder matrix after each step. Instead, these can be cheaply downdated, as described, e.g., in [33, Ch.5, Sec. 2.1]. In terms of asymptotic flop counts, this observation makes the cost of pivoting become a lower order term, and consequently both unpivoted and pivoted Householder QR algorithms have the same leading order term (4/3)n 3 in their asymptotic flop † counts for n × n matrices. In this section, we describe an analogous technique for the randomized sampling strategy described in Section 3.1. This downdating strategy was discovered by one of the authors; a closely related technique was discovered independently and published to arXiv by Duersch and Gu [10] in September 2015.
First observe that if the randomized sampling technique described in Section 3.1 is used in the obvious fashion, then each step of the iteration requires the generation of a Gaussian random matrix G and a matrix-matrix multiply involving the remaining portion of A in the lower right corner to form the sampling matrix Y . The number of flops required by the matrix-matrix multiplications add up to an O(n 3 ) term for n × n matrices. However, it turns out to be possible to avoid computing a sampling matrix Y from scratch at every step. The idea is that if we select the randomizing matrix G in a particular way in every step beyond the first, then the corresponding sampling † We use the standard convention of counting one multiply and one add as one flop, regardless of whether a complex or real operation is performed. matrix Y can inexpensively be computed by downdating the sampling matrix from the previous step.
To illustrate, suppose that we start with an n × n original matrix A = A (0) . In the first blocked step, we draw a (b + p) × n randomizing matrix G (1) and form the (b + p) × n sampling matrix Using the information in Y (1) , we identify the b pivot vectors and form the corresponding permutation matrix P (1) . Then the matrix Q (1) representing the b Householder reflectors dictated by the b pivot columns is formed. Applying these transforms to the right and the left of A (0) , we obtain the matrix To select the pivots in the next step, we need to form a randomizing matrix G (2) and a sampling matrix Y (2) that are related through where R (1) holds the top b rows of A (1) so that The key idea is now to choose the randomizing matrix G (2) according to the formula

Inserting (3.4) into (3.3), we now find that the sampling matrix is
Evaluating formula (3.5) is inexpensive since the first term is a permutation of the columns of the sampling matrix Y (1) and the second term is a product of thin matrices (recall that Q (1) is a product of b Householder reflectors). Remark 3. Since the probability distribution for Gaussian random matrices is invariant under unitary maps, the formula (3.4) appears quite safe. After all, G (1) is Gaussian, and Q (1) is just a sequence of reflections, so it might be tempting to conclude that the new randomizing matrix must be Gaussian too. However, the matrix Q (1) unfortunately depends on the draw of G (1) , so this argument does not work. Nevertheless, the dependence of Q (1) on G (1) is very subtle since this Q (1) is dictated primarily by the directions of the good pivot columns. Extensive practical experiments (see, e.g., Section 4.2) indicate that the pivoting strategy described in this section based on downdating is just as good as the one that uses "pure" Gaussian matrices that was described in Section 3.1.

3.3.
Detailed description of the downdating procedure. Having described the downdating procedure informally in Section 3.2, we in this section provide a detailed description using the notation for HQR that we used in Section 3. First, let us assume that one iteration of the blocked algorithm has completed, so that, at the bottom of the loop body, the matrix A contains Here A denotes the original contents of AP 1 , where P 1 captures how columns have been swapped so far. Hence, cf. (3.2), Now, let G 2 be the next sampling matrix and Y 2 = G 2 A 22 . In order to show how this new sampling matrix can be computed by downdating the last sampling matrix, consider that for some matrix G 1 and that The choice of randomizing matrix analogous to (3.4) is now Inserting the choice (3.7) into (3.6), we obtain which can then be used to downdate the sampling matrix Y .
3.4. The blocked algorithm. In Fig. 3.1, we give the blocked algorithm that results when the randomized pivot selection strategy described in Section 3.1 is combined with the downdating techniques described in Section 3.3. In that figure, there is a call to a function "HRQP UNB" which is an unblocked HQR algorithm with column pivoting. The purpose of this call is to factor the current column panel so that the diagonal elements within blocks on the diagonal of R are ordered from largest to smallest in magnitude. Moreover, the call to "UpdatePivotInfo" takes the pivoting that occurred within the current panel (to ensure strictly decreasing diagonal elements within the current diagonal block of R 11 ) and merges this with the pivot information that occurred when determining the columns to be moved into that current panel.

Asymptotic cost analysis.
In analyzing the asymptotic complexity of the method, we consider a matrix of size m × n, with m ≥ n. We assume that the block size b and the over-sampling parameter p are kept fixed as m and n grow. We first note that all steps in Fig. 3.1 highlighted in grey are part of the blocked HQR algorithm, which is known to have an asymptotic cost of 2mn 2 − 2/3n 3 flops. This leaves us to discuss the overhead related to the other operations.

Experiments.
This section describes the results from two sets of experiments. Section 4.1 compares the computational speed of the proposed scheme to existing state-of-the-art methods for computing column pivoted QR factorizations. Section 4.2 investigates how well the proposed randomized technique works at selecting pivot columns. Specifically, we investigate how well the rank-k truncated QR factorization approximates the original matrix and compare the results to those obtained by classical column pivoting.

Performance experiments.
We have implemented the proposed HQRRP algorithm using the libflame [35,34] library that allows our implementations to closely resemble the algorithms as presented in the paper. Implementations. We report performance for four implementations: dgeqrf. The implementation of blocked HQR that is part of the netlib implemenation of LAPACK, modified so that the block size can be controlled. dgeqp3. The implementation of blocked HQRP that is part of the netlib implemenation of LAPACK, based on [31], modified so that the block size can be controlled. HQRRPbasic. Our implementation of HQRRP that computes new matrices G and Y in every iteration. This implementation deviates from the algorithm in Fig. 3.1 in that it also incorporates additional column pivoting within the call to HQRRP unb. HQRRP. The implementation of HQRRP that downdates Y . (It also includes pivoting within HQRRP unb). dgeqpx. An implementation of HQRP with window pivoting [5,4], briefly mentioned in the introduction. This algorithm consists of two stages: The first stage is a QR with window pivoting. An incremental condition estimator is employed to accept/reject the columns within a window. The size of the window is about about twice the block size used by the algorithm to maintain locality. If all the columns in the window are unacceptable, all of them are rejected and fresh ones are brought into the window. The second stage is a postprocessing stage to validate the rank, that is, to check that all good columns are in the front, and all the bad columns are in the rear. This step is required because the window pivoting could fail to reveal the rank due to its shortsightedness (having only checked the window and having employed a cheap to compute condition estimator.) Sometimes, some columns must be moved between R 11 that has been computed and matrices R 12 and R 22 , and then retriangularization must be performed with Givens rotations. In all cases, we used algorithmic block sizes of b = 64 and 128. While likely not optimal for all problem sizes, these blocks sizes yield near best performance and, regardless, it allows us to easily compare and contrast the performance of the different implementations.
Results. As is customary in these kinds of performance comparisons, we compute the achieved performance as 4/3n 3 time (in sec.) × 10 −9 GFLOPS.
Thus, even for the implementations that perform more computations, we only count the floating point operations performed by an unblocked HQR without pivoting. Figure 4.1 reports performance on 1, 4, 8, and 14 cores. We see that HQRRP handily outperforms dgeqp3 and to a lesser degree also outperforms dgeqpx. Moreover, the asymptotic performance of HQRRP appears to approach that of dgeqrf, in particular for the single core case. We also see that while the relative performances of all 5 methods remain qualitatively the same across the four graphs, it is clear that as the number of cores grows, the speed advantage of HQRRP over dgeqp3 becomes even further pronounced, cf.    Figure 4.2. We see that the choice b = 64 tends to lead to slightly faster execution, but the key take-away from this comparison is that the speed is relatively insensitive to the precise choice of block size (within reason, of course).

Quality experiments.
In this section, we describe the results of numerical experiments that were conducted to compare the quality of the pivot choices made by our randomized algorithm HQRRP to those resulting from classical column pivoting. Specifically, we compared how well partial factorizations reveal the numerical ranks of four different test matrices: • Matrix 1 (fast decay): This is an n × n matrix of the form A = U DV * where U and V are randomly drawn matrices with orthonormal columns (obtained by performing QR on a random Gaussian matrix), and where D is diagonal with entries given by d j = β (j−1)/(n−1) with β = 10 −5 . To be precise, we discretized the so called "single layer" operator associated with the Laplace equation using a 6 th order quadrature rule designed by Alpert [1]. This is a well-known ill-conditioned operator for which column pivoting is essential in order to stably solve the corresponding linear system. • Matrix 4 (Kahan): This is a variation of the "Kahan counter-example" [22] which is designed to trip up classical column pivoting. The matrix is formed as A = SK where: for some positive parameters ζ and φ such that ζ 2 + φ 2 = 1. In our experiments, we chose ζ = 0.99999.
For each test matrix, we computed QR factorizations AP = QR (4.1) using three different techniques: • HQRP: The standard QR factorization qr built in to Matlab R2015a.
• HQRRPbasic: The randomized algorithm described in Figure 3.1, but without the updating strategy for computing the sample matrix Y . • HQRRP: The randomized algorithm described in Figure 3.1. Our implementations of both HQRRPbasic and HQRRP deviate from what is shown in Figure 3.1 in that they also incorporate column pivoting within the call to HQRRP unb.
In all experiments, we used test matrices of size 4 000 × 4 000, a block size of b = 100, and an over-sampling parameter of p = 5.
As a quality measure of the pivoting strategy, we computed the errors e k incurred when the factorization is truncated to its first k components. To be precise, these residual errors are defined via e k = AP − Q(:, 1 : k)R(1 : k, :) = R((k + 1) : n, (k + 1) : n) . (4.2) The results are shown in Figures 4.3 -4.6, for the four different test matrices. The black lines in the graphs show the theoretically minimal errors incurred by a rank-k approximation. These are provided by the Eckart-Young theorem [11] which states that, with {σ j } n j=1 denoting the singular values of A: e k ≥ σ k+1 when errors are measured in the spectral norm, and when errors are measured in the Frobenius norm. We observe in all cases that the quality of the pivots chosen by the randomized method very closely matches those resulting from classical column pivoting. The one exception is the Kahan counter-example ("Matrix 4"), where the randomized algorithm performs much better. (The importance of the last point should not be overemphasized since this example is designed specifically to be adversarial for classical column pivoting.) When classical column pivoting is used, the factorization (4.1) produced has the property that the diagonal entries of R are strictly decaying in magnitude When the randomized pivoting strategies are used, this property is not enforced. To illustrate this point, we show in Figure 4.7 the values of the diagonal entries obtained by the randomized strategies versus what is obtained with classical column pivoting.

5.
Conclusions and future work. We have described the algorithm HQRRP which is a blocked version of Householder QR with column pivoting. The main innovation compared to earlier work is that pivots are determined in groups using a technique based on randomized projections. We demonstrated that the quality of the chosen pivots is for practical purposes indistinguishable from traditional column pivoting (cf. Figures 4.3 -4.5), and that the dominant term in the asymptotic flop count equals that of non-pivoted QR. Importantly, we also demonstrated through numerical experiments that HQRRP is very fast, in fact almost as fast as unpivoted HQR.
The technique described opens up several potential avenues for future research. The speed gains we demonstrate on single core and shared memory multicore machines is due to the reduction in data movement. Equivalently, data moved between memory layers is carefully amortized. We expect the technique described to have an even more pronounced advantage over traditional column pivoted QR when implemented in more severely communication constrained environments such as a matrix processed on a GPU or a distributed memory parallel machine, or stored out-of-core.
The randomized sampling techniques we describe can also be used to construct very close to optimal rank-revealing factorizations. To describe the idea, we note that a column pivoted QR factorization of a given matrix A can be written as where Q is orthonormal, R is upper triangular, and P is a permutation matrix. In this manuscript, we used randomized sampling to determine the permutation matrix P . It turns out that for a modest additional cost, one can build a factorization that takes the form (5.1), but with both Q and P built as products of Householder reflectors. This generalization allows us to bring R not only to upper triangular form, but very close to being diagonal, with accurate approximations to the singular values of A on its diagonal. This discovery was reported in [28], and is a subject of ongoing research. How to scale the presented algorithm to very large numbers of cores is an open research question. Techniques such as "compute ahead" will have to be employed to ensure that the factorization of the current panel (A 11 and A 21 ) and downdate of Y do not start dominating the parts of the computation that can be cast in terms of gemm. Included are implementations that directly link to LAPACK [3] as well as implementations that use the libflame [30,17] library. For those who use the LAPACK routine dgeqp3 routine, a plug compatible routine dgeqp4 is provided.
A distributed memory implementation of the algorithm has been incorporated into the Elemental software package by Jack Poulson et al [29], available at: https://github.com/elemental/Elemental For each of the four test matrices described in Section 4.2, we show the magnitudes of the diagonal entries in the "R"-factor in a column pivoted QR factorization. We compare classical column pivoting (red) with the two randomized techniques proposed here (blue and green). We also show the singular values of each matrix (black) for reference.