Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVD

We address the reduction to compact band forms, via unitary similarity transformations, for the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). Concretely, in the first case, we revisit the reduction to symmetric band form, while, for the second case, we propose a similar alternative, which transforms the original matrix to (unsymmetric) band form, replacing the conventional reduction method that produces a triangular–band output. In both cases, we describe algorithmic variants of the standard Level 3 Basic Linear Algebra Subroutines (BLAS)-based procedures, enhanced with look-ahead, to overcome the performance bottleneck imposed by the panel factorization. Furthermore, our solutions employ an algorithmic block size that differs from the target bandwidth, illustrating the important performance benefits of this decision. Finally, we show that our alternative compact band form for the SVD is key to introduce an effective look-ahead strategy into the corresponding reduction procedure.


Introduction
The reduction to tridiagonal form is a crucial operation for the computation of the eigenvalues of a dense symmetric matrix when a significant part of the spectrum is required [15]. Similarly, the reduction to bidiagonal form is the preferred option to obtain (all) the singular values of a dense matrix via the singular value decomposition (SVD) [15]. The standard algorithms for these two reductions in the legacy implementation of LAPACK (Linear Algebra PACKage) [2] compute these reduced forms via two-sided, fine-grained unitary transformations. Unfortunately, these routines are rich in Level-2 BLAS (Basic Linear Algebra Subroutines) [13], which are memory-bounded kernels and, therefore, deliver only a small fraction of the peak (computational) performance of recent computer architectures.
An alternative approach replaces these low-performance standard routines with two-sided reduction (TSR) algorithms that consist of two stages [5]. The idea is to initially transform the dense matrix into a compact band form (first stage) to next operate on this by-product in order to yield the desired tridiagonal/bidiagonal form (second stage). For symmetric eigenvalue problems (SEVP), the compact by-product is a symmetric band matrix with upper and lower bandwidth w. For the SVD, the compact representation corresponds, by convention, to an upper triangularband matrix with upper bandwidth w. The appealing property of the TSR algorithms is that the initial reduction mainly consists of high performance, compute-bounded Level-3 BLAS [12], which explains the renewed interest in developing two-stage TSR algorithms for multicore processors as well as manycore accelerators [4,9,18,19]. To conclude this brief review of two-stage TSR algorithms, we note that in case w is small compared with the problem dimension, the computational cost of performing the reduction of the band matrix to tridiagonal/bidiagonal form is comparable with that of the initial reduction from the original dense matrix to the compact representation [16,10]. Furthermore, except for a few special problems, computing the eigenvalues/singular values of the tridiagonal/bidiagonal matrices contributes a minor factor to the global cost of the procedure [11,21,14,17]. In contrast, when the associated eigenvectors/singular vectors are to be computed, the cost of accumulating the unitary transformations during the second stage can be significant, even if the bandwidth is small compared with the problem dimension [4].
High-performance routines for the solution of linear systems (e.g., via the LU and QR factorizations [15]) as well as the initial stage of TSR algorithms usually implement a right-looking (RL) procedure that, at each iteration, factorizes the current panel of the matrix, and then applies the transformations that realized this reduction to update the trailing submatrix. On today's multicore platforms, the factorization of the panel is a performance bottleneck because this operation is mostly memory-bounded and, moreover, exhibits a complex set of fine-grain data dependencies. Fortunately, there exist three (to a certain extent complementary) techniques to tackle the constraint imposed by the panel factorization: T1) exploit the fine-grain parallelism within the panel itself [7]; T2) divide the factorization of the current panel into multiple operations whose execution can then be overlapped with certain parts of the trailing update, yielding the so-called algorithmsby-blocks or tile algorithms [16,6,22]; and T3) overlap the trailing update with the factorization of the "next" panel(s) [23].
Note the distinction between T2), which aims to exploit the parallelism among operations (tasks) in the same iteration of the RL algorithm; and T3), which exploits the parallelism among operations belonging to two (or more) consecutive iterations of the RL algorithm. Here it is worth pointing out that recent developments on the semi-automatic task-parallelization of dense linear algebra operations with the support of a "runtime" (such as SuperMatrix, Quark, OmpSs, StarPU, OpenMP, etc.) have partially blurred the frontier between T2) and T3). In particular, when this type of task-parallelization is applied to an algorithm-by-blocks for the solution of linear systems in order to realize T2), the result is that T3) is often obtained for free. For some TSR algorithms though, as we will discuss in the paper, this may be more difficult or even impossible.
In this paper we focus on T3), which is usually known as look-ahead [23]. Here, as we do not rely on a runtime to exploit "inter-iteration" parallelism, we can refer more precisely to this strategy as "static" look-ahead. While this technique has been long known and exploited for the solution of linear systems via the LU and QR factorizations 1 , its application to TSR algorithms has not been fully discussed explicitly. In this paper we show that look-ahead can be introduced in the sophisticated TSR algorithms for SEVP and the SVD, delivering remarkable performance benefits. In particular, our paper makes the following contributions: • We explore the integration of look-ahead into the reduction of symmetric matrices to band form for SEVP via two-sided unitary transformations. In this line, we propose two variants of the reduction algorithm, enhanced with look-ahead, with distinct performance behaviour depending on the ratio between the algorithmic block size b (which dictates the number of columns in the panel,) and the target matrix bandwidth w. While LAPACK (version 3.7.1) and MAGMA (version 2.1.0) both include routines for SEVP to reduce the symmetric input matrix to band form, those implementations impose the restriction that the algorithmic block size must equal the bandwidth, limiting performance. Furthermore, the LAPACK routine for this reduction does not integrate look-ahead. The SBR (Successive Band Reduction) package [5] was a pioneer work that decoupled the bandwidth from the algorithmic block size in this type of reduction, but did not integrate look-ahead either.
• We extend our analysis of look-ahead to the reduction of general matrices to band form for the SVD via two-sided unitary transformations. Here we depart from the conventional TSR to band-triangular form, which imposes certain restrictions on the application of look-ahead, to advocate for the reduction to band form with equal lower and upper bandwidths. This change, in turn, yields two variants for the reduction algorithm for the SVD which are analogous to those identified for SEVP.
• We demonstrate the performance benefits of static-look ahead, using the reduction to band forms for SEVP and the SVD, on an Intel-based platform equipped with 8 Haswell cores. Our experimental analysis of the optimal block size clearly shows the importance of decoupling the algorithmic block size from the bandwidth, and the advantages of each variant.
The introduction of look-ahead paves the road to overlapping the panel factorization on a CPU with the execution of the (rich in Level-3 BLAS) trailing update on an accelerator (e.g., a GPU). Furthermore, on a multicore architecture, an algorithm that explicitly decomposes the TSR to expose look-ahead can apply this technique with variable depth, using the support of a runtime such as OpenMP, OmpSs or StarPU. In both cases, we can expect a notable increase of performance, as i) the panel factorization is potentially removed from the critical path of the algorithm; and ii) the algorithmic block size is decoupled from the bandwidth. The rest of the paper is structured as follows. In Sections 2 and 3 we describe the introduction of look-ahead in the first stage of the TSR algorithms for SEVP and the SVD, respectively. In Section 4 we assess the benefits of a flexible implementation of this technique by experimentally demonstrating its effects for the TSR of dense matrices to the selected band forms for SEVP and the SVD. Finally, in Section 5 we close our paper with a few concluding remarks and a discussion of future work.
To close this introduction, we note that the mathematical equations, algorithms, and the evaluation in the remainder of the paper are all formulated for problems with real data entries, using orthogonal transformations, but their extension to the Hermitian case, involving unitary transformations, is straight-forward.

Basic algorithm
Let us first describe the algorithm that reduces a dense symmetric matrix A ∈ R n×n to symmetric band form, with bandwidth w, via orthogonal similarity transformations. This procedure is numerically stable and, moreover, preserves the eigenvalues of the matrix [15]. Suppose that the first k − 1 rows/columns of A have been already reduced to band form; the algorithmic block size Figure 1: Partitioning of the matrix during one iteration of the reduction to symmetric band form for the solution of SEVP. satisfies b ≤ w; and assume for simplicity that k + w + b − 1 ≤ n; see Figure 1. At this point we note the key roles of the bandwidth w and the block size b, and their interaction. The optimal bandwidth itself depends on the efficiency of the second stage of the reduction and, therefore, it cannot be chosen independently. To complicate things a bit further, the optimal bandwidth also depends on the problem dimensions and the selected block size. The take-away lesson of this short discussion is that the best combination of bandwidth and block size depends on several factors, some of which are external to the implementation of the first stage.
During the current iteration of the reduction procedure, b new rows/columns of the band matrix are computed as follows: 1. Panel Factorization. Compute the QR factorization where A 0 ∈ R j×b , with j = n − (k + w) + 1; R ∈ R j×b is upper triangular; and the orthogonal matrix Q is implicitly assembled using the WY representation [15] as Q = I j + W Y T , where W, Y ∈ R j×b and I j denotes the square identity matrix of order j.
2. Trailing Update. Apply the orthogonal matrix Q to A 1 ∈ R j×w−b from the left: and to A 2 ∈ R j×j from both left and right: During this last operation (only the lower or upper triangular part of) A 2 is updated, via the following sequence of Level-3 BLAS operations: Provided b and w are both small compared with n, the global cost of the reduction of a full matrix to band form is 4n 3 /3 floating-point arithmetic operations (flops). 2 Furthermore, the bulk of the computation is performed in terms of the Level-3 BLAS operations in (4) and (7).
The problem with this basic algorithm is that the panel factorization in (2) is mainly memorybounded (at least, for the usual values of b) as well as features some complex dependencies so that, as the number of cores performing the factorization is increased, the panel operation rapidly becomes a performance bottleneck. We next describe how to solve this problem via two algorithmic variants that implement a static look-ahead in order to overlap in time (i.e., run concurrently) the execution of the trailing update for the current iteration with the factorization of the next panel.

Introducing look-ahead
Consider the blocks A 0 , A 1 , A 2 involved in iteration k of the basic algorithm (see Figure 1); and let us refer to the panel that will be factorized in the subsequent iterationk = k + b − 1 asĀ 0 . The key to formulate a variant of the basic algorithm enhanced with look-ahead lies in: 1. identifying the parts of the trailing submatrix [A 1 , A 2 ] that will becomeĀ 0 during the next iteration; 2. isolating the updates corresponding to application of the orthogonal transformations in (3) that affectĀ 0 from those which modify those parts of [A 1 , A 2 ] that do not overlap withĀ 0 ; and 3. during iteration k, overlapping the factorization of the subsequent panelĀ 0 (look-ahead factorization) with the updates corresponding to this iteration.
At this point we distinguish two cases, leading to two variants of the TSR algorithm with look-ahead, depending on the relation between b and w: • Variant V1: 2b ≤ w. In this case,Ā 0 lies entirely within A 1 , as the number of columns in the latter satisfies w − b ≥ b. We then define the following partitioning of the trailing submatrix: where A L 1 consists of b columns, A TL 1 is b × b, and we (use the red color to) distinguish those blocks that overlap in the column range ofĀ 0 . During iteration k, we can then perform the following three groups of operations (left, middle and right) concurrently:

Sequential panel factorization
Parallel remainder update Note that, with this partitioning,Ā 0 is generally small compared with A 2 . Therefore, we can expect that the factorization of the subsequent panelĀ 0 can be overlapped with the update of A 2 on the right, eliminating the former from the critical path of the reduction. On a multicore architecture, we can achieve this by dedicating a few threads/cores to the panel factorization while the remaining ones compute the trailing update. On a CPU-GPU system, the CPU can take care of the panel factorization while the GPU updates the trailing submatrix. Hereafter, we will refer to these (two groups of) computational resources, few threads/CPU and many threads/GPU, as T S (for sequential) and T P (for parallel) respectively.
There exists a direct dependency between the two operations on the left-hand side group, that we can denote as A L 1 →Ā 0 . Here, the update of A L 1 is a Level-3 BLAS operation that in general will offer low performance as the width of the panel is usually small. Due to the dependency and this low performance, the update of A L 1 will be performed by T S . As for the update of A R 1 , in the middle "group", in case this is also a narrow column panel (i.e., w − 2b is small), we can expect low performance from it, so that it should join the group of "sequential" operations on the left (red group), to be performed by T S . Otherwise, it can be merged with the "parallel" group on the right, to be computed by T P .
• Variant V2: 2b > w. In this case,Ā 0 expands beyond the columns of A 1 to partially overlap with A 2 . Let us consider the following partitioning: where During iteration k, we initially compute the following operations: which correspond to the update of A 1 and part of the computations necessary for the update of A 2 ; see (4)- (7). After this is completed, we can concurrently perform the following two groups of operations: Sequential panel factorization Figure 2: Dependencies among operations appearing in Variant V2 of the initial TSR to symmetric band form. For simplicity, each operation is identified by its output operand.
As in the previous variant (case), this pursues the goal of overlapping the factorization of the next panelĀ 0 with a sufficiently-large Level-3 BLAS. In general,Ā 0 is small compared with the trailing submatrix A R 2 so that we can expect this is the case. To close the discussion of Variant V2, we note the collection of dependencies appearing among the operations identified in this case; see Figure 2. For these operations, the panel width determines whether they involve narrow panels and, therefore, can be considered memorybounded low-performance kernels. Thus, together with the dependencies, this property will ultimately decide whether they are moved to the groups of either sequential or parallel kernels, to be tackled by T S or T P , respectively. For example, one possibility is to update A 1 on T S , while X 1 , X 2 , X 3 are being computed by T P ; when all these operations are completed, we can continue with the update of A L 2 and the factorization ofĀ 0 on T S , while A R 2 is being updated by T P .
At this point, it is fair to ask what is the value of explicitly exposing static look-ahead if the same effect could be obtained, in principle, with the combination of an algorithm-by-blocks and the support of a task-parallelizing runtime. Armed with the previous discussion of look-ahead, we can now offer several arguments in response to this question: 1. Exposing the look-ahead variant provides a better understanding of the algorithms.
2. Static look-ahead can be as efficient as or even outperform a runtime-assisted dynamic solution [8]. The reason is that, for regular dense linear algebra operations such as those in the Level-3 BLAS, dividing these kernels into fine-grain operations incurs into some packing/unpacking overheads. In addition, the use of a runtime promotes the exploitation of task-parallelism at the cost of a suboptimal use of the cache hierarchy.
3. As exposed in the next section, for the tile algorithm proposed for the reduction to triangularband form, the application of a runtime may not allow per se the exploitation of taskparallelism among operations belonging to different iterations.
4. The look-ahead variants do not require the implementation of tuned kernels to factorize blocks with special structures as those that appear in the algorithm-by-blocks, and apply the corresponding transformations. 3 Moreover, they do not incur the overhead due to the operation with these kernels and do not have an internal block size that needs to be tuned [6,22]. 5. For CPU-GPU platforms, static look-ahead can be the only practical means to eliminate the panel factorization from the critical path of the algorithm.

Triangular-band form
The conventional algorithm for the first stage of the TSR algorithm for the SVD computes an upper triangular matrix with upper bandwidth w. To describe this procedure, consider a matrix A ∈ R m×n , where the first k rows/columns have already been reduced to the desired triangularband form; and assume that k +w +b−1 ≤ m, n, with m ≥ n. Furthermore, let us consider initially the simpler case with w = b; see Figure 3. During the current iteration, the following computations thus advance the reduction by w additional rows/columns: where 3. Right Panel Factorization. Compute the LQ factorization [15] where C ∈ R w×j ; L ∈ R w×j is lower triangular; and : Matrix partitioning (left) and dependencies (right) for the reduction to triangular-band form for the SVD (w = b).

Right Trailing Update.
Apply V to D ∈ R (i−w)×j , from the right: Assuming w m, n, this algorithm requires 4(mn 2 − n 3 /3) flops and the major part of these operations are concentrated in the trailing updates (15), (17), which correspond to high performance Level-3 BLAS.

Triangular-band form and look-ahead
Unfortunately, the two panel factorizations in (14), (16) impose the same bottleneck as that discussed for the reduction to the symmetric band form. Furthermore, in the reduction to triangularband form overcoming this problem via a look-ahead strategy enforces certain constraints on the relation between w and b that may impair performance. Let us explain this in detail via three cases, where the first one corresponds to the simple scenario with w = b, and the remaining two decouple the block size from the bandwidth so that b ≤ w.
First case: w = b. Consider the scenario illustrated in Figure 4 where the superindices (as, e.g., in the factorization B 0 = U 0 R 0 ) indicate the iteration count (starting at 0), and the subindices specify the index of the block being updated by the corresponding transformations (either from the left, as in U 0 1 , or from the right, as in V 0 1 ). The arrows correspond to data dependencies and thus define a partial ordering for the execution of the operations. For simplicity, let us aggregate the updates U k k+1 , U k k+2 , U k k+3 , . . . into a single macro-update U k and, similarly, V k k+1 , V k k+2 , V k k+3 , . . . into the macro-update V k . Then, it is easy to verify that the existing dependencies enforce the strict ordering . ., revealing that it is not possible to exploit look-ahead in this case.
Second case: w = 2b. The new situation is displayed in Figure 5, which will be leveraged to expose that the dependency problem identified in the previous case partially remains. In particular, let us now aggregate the updates U k k+2 , U k k+3 , U k k+4 , . . . into a single macro-updateŪ k and Figure 5: Matrix partitioning (left) and dependencies (right) for the reduction to triangular-band form for the SVD (w = 2b). Figure 6: Matrix partitioning (left) and dependencies (right) for the reduction to triangular-band form for the SVD (w = 3b).
Moreover, for simplicity let us consider that the updates of the form U k k+1 and V k k+1 respectively occur inside the factorizations U k+1 R k+1 and L k+1 V k+1 . Then, we can initially compute U 0 R 0 ; followed by the overlapped execution of the factorization U 1 R 1 with the macro-updateŪ 0 ; and the factorization L 1 V 1 next. At this point, we would like to overlap U 2 R 2 withŪ 1 and L 1 V 1 withV 0 . However, because of the dependencies, we can exploit one of the overlappings, but not both. To see this, assume our goal is to encode the first overlapping. Then, V 0 must be available (green lines), but L 1 V 1 cannot be computed yet (black lines). Therefore, the second overlapping is not possible. Conversely, assume that we intend to encode the second overlapping. Then,Ū 1 must be available (black lines), but L 1 V 1 cannot be computed yet (green lines), and the first overlapping is not possible.
Third case: w = 3b. Consider next the scenario in Figure 6. Let us use the macro-updateŪ k to stand now for U k k+3 , U k k+4 , U k k+5 , . . .; andV k for V k k+3 , V k k+4 , V k k+5 , . . .. Also, assume for simplicity that the updates of the form U k k+1 , U k k+2 and V k k+1 , V k k+2 respectively occur as part of the factorizations U k+1 R k+1 and L k+1 V k+1 . As in the previous case, we can initially compute U 0 R 0 ; followed by the overlapped execution of the factorization U 1 R 1 with the macro-updateŪ 0 ; and the factorization L 1 V 1 next. However, because of the distinct dependencies that are present in this third case, nothing prevents us in the following steps from overlapping U 2 R 2 withŪ 1 ; L 1 V 1 with V 0 ; U 3 R 3 withŪ 2 ; and so on.
The conclusion from this study is that, in the reduction to triangular-band form, applying lookahead for both the left and right panel factorizations requires w ≥ 3b. While this is doable, it has some practical implications on the relation between the practical values of w and b. On one hand, w should be kept small to moderate because the selection of a large value delays much of the computational cost into the second stage (reduction from triangular-band to bidiagonal form), which is realized via slower Level-2 BLAS. On the other hand, b needs to be set to a large value as otherwise the updates will not fully benefit from the performance of Level-3 BLAS. The practical consequence is that the constraint that w ≥ 3b in this type of decomposition can exert a strong negative impact on performance.
Pipelining the factorizations. Consider again the simple case w = b and the operations (14)- (17) to be computed in a given iteration. A different (but related) possibility to attain an overlapped execution is to decompose the left panel factorization in this iteration into several column micropanels, of width b l < w, and then overlap the factorization of the micro-panels with their application to the remaining part(s) of the matrix (i.e., within the same micro-panel within B as well as to E). When this is completed, the algorithm proceeds to the right panel factorization in the same iteration, and basically applies the same idea using row micro-panels of height b r . With this strategy we can choose a value for the micro-panels that simply satisfies w = 2b l = 2b r . However, note that with this approach, the first micro-panel for both the left and right panel factorizations (at each iteration) cannot be overlapped, with a strong negative impact on performance. This will enforce us to select w ≥ 3b l , 3b r , with the same consequences as those discussed in the previous paragraph. Even worse, with this approach, the first left and right panel factorizations of each iteration cannot be overlapped.
Other implementations. The discussion of the reduction to triangular-band reveals a strong limitation when aiming to exploit task-parallelism among operations belonging to different iterations. We note that this algorithm is precisely the selection that was made for the message-passing implementation of TSR to triangular-band form in [16]. It is also the choice for the tile algorithms in PLASMA that perform this reduction on multicore platforms in [18,19]. In addition, all of these implementations couple the algorithmic block size to the bandwidth, so that b = w, with a potential negative effect on performance.

Band form and look-ahead
In [16] the authors explored the triangular-band reduction as well as an alternative algorithm that reduces the dense matrix to a band form with the same upper and lower bandwidth. However, the latter algorithm was abandoned in that work as it did not offer any special advantage. In this subsection we show that this approach is actually the key to obtaining two variants for the reduction to band form for the SVD, enhanced with look-ahead, which are analogous to those already presented for SEVP in subsection 2.2. Importantly, this approach does not enforce that w ≥ 3b, as was the case of the reduction to triangular-band form. Before we review this algorithm, we note that the cost of applying this procedure to reduce a dense matrix to band form, with upper and lower bandwidth w = w /2, is about the same as that of reducing the matrix to triangular-band form with bandwidth w . The basic algorithm (i.e., without look-ahead) is very similar to the reduction to symmetric band form, with the differences stemming from the fact that A is now an unsymmetric matrix, which requires separate left and right factorizations. As usual, consider that the first k − 1 rows/columns of A have been already reduced to band form; select b ≤ w; and assume for simplicity that k + w + b − 1 ≤ m, n; see Figure 7.
During the current iteration of the reduction procedure, b new rows/columns of the band matrix are computed as follows (for brevity, we do not state explicitly the dimensions and properties of the matrix blocks/factors in the following, as they can be easily derived from the context and Figure 7):

Left Panel Factorization. Compute the QR factorization
2. Left Trailing Update. Apply U to the trailing submatrix:

Right Panel Factorization. Compute the LQ factorization
4. Right Trailing Update. Apply V to the trailing submatrix: From these expressions, let us now re-consider the two cases leading to Variants V1 and V2 of the look-ahead strategy: • Variant V1: 2b ≤ w. The next panelsB 0 andC 0 lie entirely within B 1 and C 1 , respectively. Therefore, the update and factorization of these panels can be overlapped with the updates performed on D from the left and right, respectively.
• Variant V2: 2b > w. Now bothB 0 andC 0 extend to overlap with D. The key to introduce look-ahead is that the left and right updates of D can be performed "simultaneously" as follows [16]: Therefore, we can initially perform the updates of B 1 , C 1 and compute Z L , Z R , X. Next, we partition the update of D to expose those parts of the result that overlap withB 0 and/orC 0 :

Experimental Evaluation
In this section, we analyze in detail the performance benefits obtained by introducing the lookahead strategies formulated in this paper as well as the decoupling of the algorithmic block size from the bandwidth in the TSR algorithms for SEVP and the SVD. All experiments were performed using IEEE double-precision arithmetic on an Intel Xeon E5-2630 v3 processor (8 cores running at a nominal frequency of 2.4 GHz). The implementations were linked with BLIS (version 0.1.8) [24].
In the experiments, we employed square symmetric matrices for SEVP, and both square and rectangular matrices for the SVD, with random entries uniformly distributed in (0, 1), and dimensions of up to 10000 in steps of 500. We reiterate that the optimal bandwidth w depends not only on the implementation of the first stage, but also on that of the second stage, for which there exist multiple algorithms and tuned implementations, depending on the target architecture [9,18,19,10], the problem size, etc. For this reason, we decided to test the algorithms using six bandwidths: w = {32, 64, 96, 128, 192, 256}. For these cases, the block size b was then tuned using values ranging from 16 up to w/2 for Variant V1 and up to w for V2, in steps of 16. We employed one thread per core in all executions. For the look-ahead versions, we set T S with 1 core and T P with the remaining 7 cores; for the reference implementations without look-ahead, there is no separation of the threads into groups so that all of them participate in the execution of each BLAS.
In all cases, we use the nominal flop count to compute the GFLOPS (billions of flops/sec.). For example, for the reduction in the SEVP, we employ 4n 3 /3 flops independently of the target bandwidth w. Since the comparison between algorithms/variants is performed in an scenario with fixed w, this is a reasonable approach to obtain a scaled version of the execution time, with the differences being more visible for smaller problem sizes than those that could have been exposed using the execution time itself.

Reference implementation
Our implementation of (the first stage of) the TSR algorithms for SEVP is based on the codes in the SBR package [5]. On the original version of these codes, we performed two relevant optimizations: • We replaced the routine for the panel factorization in the SBR package, based on Level-2 BLAS, for an alternative that factorizes this panel using a blocked left-looking (LL) algorithmic variant that, furthermore, relies on Level-3 BLAS. The inner block size for this routine was set to 16, with this value being determined (i.e., tuned) in an independent experiment. The LL variant was selected because it offered higher performance than its RL counterpart in our experiments.
• The routine that builds the matrices W, Y in SBR, which define the orthogonal factors, was modified to assemble W as the product of Y and the triangular b×b matrix T that is obtained from the alternative compact WY representation [15]. This modification considerably reduced the cost of building W as this can then be based entirely on Level-3 BLAS.
The implementations of the TSR algorithms for the SVD re-utilized as much as possible of these building blocks, including the ideas underlying the previous two code optimizations. The legacy LAPACK (version 3.7.1) comprises routine dsytrd sy2sb for the reduction of a symmetric matrix to symmetric band form which also features these optimizations (except for the use of the left-looking factorization). However, the LAPACK routine does not include look-ahead and, furthermore, it imposes the restriction that b = w. As our experimental evaluation of the optimal block size will show, this limitation severely impairs performance.

The role of the block size
In practice, the algorithmic block size b has an important impact on performance. For the particular case (of the first stage) of both TSR algorithms, the block size should balance two criteria: 1. Deliver high performance for the BLAS-3 kernels that compose the trailing update. Small values of b turn W, Y, W U , Y U , W V , Y V into narrow column panels, affecting the performance of the Level-3 BLAS, since the amount of data reuse is greatly reduced so that, eventually, the kernels become memory-bounded.
2. Reduce the amount of flops performed in the panel factorization. Small values of b reduce the number of operations performed in these intrinsically-sequential operations. Figure 8 illustrates the interplay between the block size and the bandwidth, using the reduction of symmetric matrices to a banded form (first stage of SEVP). The figure includes the reference implementation (hereafter, labeled as "Reference SEVP") as well as the two look-ahead variants introduced in this work.
This experiment reveals that, for the small problem (two top plots in Figure 8), the optimal performance is attained for small values of the block size, since they balance the execution times of the panel factorization and the trailing update. In rough detail, for the small problem dimensions, if the block size is too large, the panel factorization will require more time to complete than the trailing update. As a consequence, many computing resources (i.e., threads/cores) will become idle during the iteration, degrading performance. At this point, we note that the degree of resource concurrency also exerts its impact on the workload balance, as this factor can change the problem size threshold from which the trailing update is more expensive than the panel factorization.
In addition, the plots show that, regardless of the implementation, the lowest performance for the large problem (two bottom plots in Figure 8) is observed for the smallest and the largest block sizes, with the optimal choice residing in the middle range. For the smallest block size, the BLAS-3 kernels invoked from the trailing update cannot efficiently utilize all the computational potential of the platform. For largest block sizes too many flops are devoted to the panel factorization.
The previous experimental analysis conforms that in [8] for the LU factorization, and exposes that choosing the optimal block size is a non-trivial task since this parameter depends on many other factors such the problem dimension n, the bandwidth w, and the degree of parallelism. To further complicate the selection of the optimal block size, as the reduction progresses, the operation is decomposed into sub-problems of dimension n − b, n − 2b, . . .. This implies that the optimal block size for the initial sub-problem(s) may not be the optimal for the subsequent ones. To partially compensate for this, the computational cost of the sub-problems rapidly decreases with their dimensions.
These experiments clearly show that coupling the block size to the bandwidth, so that b = w, in general results in suboptimal performance. Taking into consideration the elaboration and results in this subsection, in the remainder of this paper we will perform an extensive tuning of the block size, for each problem dimension and bandwidth.

Performance of TSR to symmetric band form for SEVP
In this section, we analyze in detail the performance behavior of the multi-threaded variants with look-ahead aimed to enhance the computational throughput of the algorithms for the first stage of SEVP. Specifically, the following implementations are compared: • Reference SEVP: Reference implementation from SBR with the optimizations described in subsection 4.1.
• Variant V1: Look-ahead variant for problems with 2b ≤ w. Two different mappings of the update of A R 1 to the threads/cores were considered, depending on whether this operation is performed by either T S or T P . In both cases, the factorization ofĀ 0 and the update of A L 1 are performed by T S ; and the update of A 2 is done by T P . Our experiments with these mappings demonstrated that the first option, which updates the full A 1 using T S , always delivers equal or lower performance than the alternative mapping for this variant. (  small bandwidths, A R 1 is consequently small, and its execution time does not affect the overall execution time of the reduction; for large bandwidths, the multi-threaded execution of A R 1 is the preferred choice.) Therefore, for clarity, we removed the first mapping from the following plots.
• Variant V2: Look-ahead variant for problems with 2b > w. Two different mappings are possible, depending on which threads update A 1 , X 1 , X 2 , X 3 .
-A 1 on T S (while X 1 , X 2 , and X 3 are computed concurrently on T P ).
-A 1 on T S + T P (after which, X 1 , X 2 , X 3 are updated by the same T S + T P threads).
The plots in the left-hand side of Figure 9 and Figure 10 report the GFLOPS rates attained by the configurations (namely algorithms/variants/mappings) for several bandwidths in the range 32-256. The right-hand side plots in both figures illustrate the optimal block size for each problem dimension and configuration, showing the crucial role of this parameter.
The first conclusion that can be extracted from the plots is that the performance of the two Variants V1 and V2 enhanced with look-ahead depends on the ratio between the algorithmic block size b (equivalent to the number of columns in the panel), the target matrix bandwidth w, and the problem dimension n. Focusing on Variant V1, we identify a drawback due to the limitation imposed on the block size by the condition 2b ≤ w. This implies that, for small bandwidths, Variant V1 can only employ very reduced block sizes. In consequence, the invocation to the Level-3 BLAS to perform the trailing update cannot efficiently exploit all resources of the processor. To illustrate this behavior, let us focus on the experiments with w = 64 in Figure 9. The right-hand side plot there shows that, for Variant V1, the block size is always 32, which is smaller than those selected for Variant V2 and the reference implementation. In addition, the left-hand side plot shows that, despite Variant V1 integrates look-ahead, its performance is inferior to that of the reference implementation for large problem dimensions as the overall execution time in those cases is dominated by the trailing update. In contrast, in the plots using larger bandwidths (i. e. w = 128 and 256) we observe that this fact enables the selection of larger block sizes, considerably improving the throughput of Variant V1, which now outperforms all other configurations.
Our implementations of Variant V2 always improve the performance of the reference routine; and also that of Variant V1 for small bandwidths, and for medium-size bandwidths combined with small problem dimensions. Indeed, as there are no strict restrictions on the block size for Variant V2 (other than b ≤ w), large values can be selected for b, and the Level-3 BLAS for the trailing update tend to deliver a good fraction of the peak performance even for small bandwidths. However, the drawback of Variant V2 lies in the parts of the algorithm that may be overlapped as part of the look-ahead strategy. In particular, as 2b can be larger than w, the factorization ofĀ 0 cannot start till the updates of A 1 and X 3 are completed (requiring a synchronization point after them). This implies that only the factorization ofĀ 0 and the update of A L 2 can be overlapped with the update of A R 2 . In contrast, for Variant V1, the factorization ofĀ 0 only requires that the update of A L 1 is completed, therefore removing this synchronization point; in consequence, the update of A L 1 and the factorization ofĀ 0 can be overlapped with the update of A 2 .
To close the experiments in this subsection, we evaluate the impact that the TSR algorithms for SEVP make on the overall computation of the eigenvalue decomposition. We remind that, in the two-stage reduction to tridiagonal form, the matrix is reduced to a symmetric band form employing one of the TSR algorithms presented earlier in this subsection (the SBR-based Reference, Variant V1 or Variant V2) and this banded matrix is next reduced to tridiagonal form. For the second stage, in our evaluation we will employ routine SBRDT from the SBR package. After that, the eigenvalues are obtained using routine DSTERF from LAPACK. Alternatively, when using the the traditional solver for SEVP in LAPACK, the input matrix is directly reduced to tridiagonal form, using routine DSYTRD routine from LAPACK, after which routine DSTERF is applied to obtain the eigenvalues. Routines SBRDT, DSTERF and DSYTRD are mostly composed of calls to kernels in the Level-1 and Level-2 of BLAS. Due to the sequential implementation and lack of optimization of these kernels in BLIS, in our experiments we linked these routines to Intel MKL. (The initial reduction to band form via the TSR algorithms was still performed using the kernels from BLIS.) Table 1 reports the execution time of the different stages that are present in the solution of SEVP as well as the acceleration with respect to the single-stage reduction approach to tridiagonal form for three problem sizes and several bandwidth dimensions. These results show that, for the smallest problem, as the symmetric matrix fits in the L3 cache on chip, the best option is to employ the conventional solver in LAPACK, based on routines DSYTRD+DSTERF. On the other hand, for the larger two problems, the best option corresponds to the two-stage reduction to tridiagonal form, using Variant V2 with a narrow bandwidth w = 64 in both cases. In particular, the speedups with respect to LAPACK's solver with a single-stage reduction are 1.96 and 2.78 when the complete process is considered. Focussing on the two-stage approach, the results also expose the need to limit the bandwidth of the compact form as the cost of routine SBRDT rapidly grows with w. Finally, the table reveals that the speed-ups observed for Variant V2 with respect to the reference implementation vary between 1.16 and 1.19 for the two largest problem sizes and w = 64. At this point, we note that the contribution of the new TSR to band form to the total cost of the eigenvalue computation is largely dependent on the implementation and efficiency of the subsequent stages. (Thus, for example, the results can be significantly different if one employs a solver that directly obtains the eigenvalues from the band form, without requiring the reduction to tridiagonal form [20], or just applies an iterative solver on the band matrix to obtain a few selected eigenvalues [1].) However, the acceleration factors observed for Variants V1 and V2 with respect to the reference implementation in the first stage will remain constant.

Performance of TSR to band form for the SVD.
We next analyze the performance behavior of the multi-threaded variants with look-ahead aimed to enhance the computational throughput of the TSR algorithms for the first stage of the SVD. Specifically, the following implementations are compared: • Reference implementations: These routines depart from the one presented for SEVP in that the matrix is not symmetric and, therefore, distinct left and right panel factorizations are required. In addition, two different implementations are possible for the SVD, depending on how the trailing update is performed: -Reference SVD. This case adheres to the formulation in equations (18)- (23). At each iteration this implementation first computes the QR factorization; then applies the resulting orthogonal matrix U to the trailing submatrix; next computes the subsequent LQ factorization; and finally applies the resulting V to the trailing submatrix.
-Reference SVD Simultaneous. This implementation performs the update of the trailing submatrix as in (27). That is, at each iteration of the algorithm, the QR and LQ factorizations are performed first; and then the updates of the trailing submatrix with U and V are fused into a single "step". B R 1 ; C T 1 and B B 1 ). However, for similar reasons, in the experiments we only consider the case with B R 1 and C B 1 updated by T P . • Variant V2: Look-ahead variant of the "Reference SVD Simultaneous" implementation for problems with 2b > w. Two different mappings are possible, depending on which threads update B 1 , C 1 , Z L , Z R , X.
-B 1 and C 1 on T S (while Z L , Z R , and X are computed concurrently on T P ).
-B 1 and C 1 on T S + T P (after which, Z L , Z R , and X are updated by the same T S + T P threads).
Following the optimizations presented earlier for the SBR routine, all routines for the SVD perform the factorization of the panels via Level-3 BLAS procedures, and compute the matrices W U , W V from the product of the corresponding compact WY factors T U , T V and Y U , Y V . Figure 11 reports the GFLOPS rates attained by the configurations for bandwidths ranging from 32 to 256 and square matrices. For brevity, the analysis of the optimal block size is not presented as it revealed a similar behavior as that observed for SEVP. Let us focus first on the implementations without look-ahead. From the plots, it is clear that the "Reference SVD" implementation outperforms its "Reference SVD Simultaneous" counterpart. At this point we would remark that the second implementation underlies variant V2 of the SVD algorithm with look-ahead.
Focusing on Variant V1, we detect the same drawback as that identified in Variant V1 for SEVP in that, for small bandwidths, the block size is strongly constrained. In contrast, large performance improvements are reported, compared with all other implementations, for medium and large bandwidths.
A less pleasant case is encountered for Variant V2, which is not able to improve significantly the performance results of the "Reference SVD" implementation for any bandwidth nor problem dimension though it outperforms its baseline "Reference SVD Simultaneous" implementation. The reason for this result is that, for this implementation, the update of D 22 cannot be fully overlapped with the execution time of the next panel factorizations (both left and right). For Variant V1, the execution of the panel factorizations is overlapped with the updates of B R 1 , C B 1 , and D; but due to the data dependencies in Variant V2, we can only overlap the execution of the panel factorizations with the update of D 22 , which exhibits a considerably more reduced number of flops. Figure 12 displays the GFLOPS rates observed for bandwidths ranging from 32 to 256 and non-square matrices. In the plots, the m-dimension on the matrices is fixed to 10000, while the n-dimension is varied in the range 500-10000 in steps of 500. The plots reveal performance numbers that are very close to those observed for the reductions of square matrices, showing that the new variants for TSR are not sensitive to the matrices shape.

Concluding Remarks
We have analyzed in detail the impact that static look-ahead exerts on the performance of two-sided routines that perform the reduction to compact band forms for SEVP and the SVD. Our study shows that a correct selection of the look-ahead variant as well as an appropriate mapping of tasks to cores are key to optimize performance. Even more importantly, the block size plays a crucial role in the computational throughput of these reduction routines. Decoupling this parameter from the target bandwidth is a must, and therefore we have to depart from the solution adopted in the corresponding routines included in the current versions of LAPACK, PLASMA and MAGMA, which simply set the block size to equal the bandwidth.
For the SVD, our analysis also advocates for an alternative option that reduces the original dense matrix to a band form with the same upper and lower bandwidths, allowing an efficient exploitation of the look-ahead strategy. This choice thus overcomes some of the difficulties of the traditional reduction to band-triangular from that is adopted in LAPACK and MAGMA.