An implicit Hari--Zimmermann algorithm for the generalized SVD on the GPUs

A parallel, blocked, one-sided Hari--Zimmermann algorithm for the generalized singular value decomposition (GSVD) of a real or a complex matrix pair $(F,G)$ is here proposed, where $F$ and $G$ have the same number of columns, and are both of the full column rank. The algorithm targets either a single graphics processing unit (GPU), or a cluster of those, performs all non-trivial computation exclusively on the GPUs, utilizes their resources to almost the full extent with data large enough, requires the minimal amount of memory to be reasonably expected, scales satisfactorily with the increase of the number of GPUs available, and guarantees the reproducible, bitwise identical output of the runs repeated over the same input and with the same number of GPUs.


Introduction
The two-sided Hari-Zimmermann algorithm [7,23,8] is a Jacobi-type method for computing the generalized eigenvalue decomposition (GEVD) of a matrix pair (A, B), where both matrices are Hermitian of the same order and B is positive definite.
If A and B are instead given implicitly by their factors F and G (not necessarily square nor with the same number of rows), respectively, such that (A, B) = (F * F, G * G), then the GEVD of (A, B) can be computed implicitly, i.e., without assembling A and B in entirety from the factors, by a modification of the Hari-Zimmermann algorithm [17].However, pivot submatrices of A and B of a certain, usually small order are formed explicitly throughout the computation.
The modified algorithm is a method that jointly orthogonalizes the pairs of columns of F and G by a sequence of transformations that are applied from the right side of the factors $ This work has been supported in part by Croatian Science Foundation under the project IP-2014-09-3670.
* Corresponding author Email addresses: novakoni@uji.es(Vedran Novaković), ssinger@fsb.hr(Sanja Singer) only.Such a one-sided algorithm computes U, Σ F , V, Σ G , and Z, where FZ = UΣ F , GZ = VΣ G , and U * U = V * V = I.The matrix Z is square and nonsingular, while Σ F and Σ G are nonnegative, diagonal, and scaled such that Σ 2 F + Σ 2 G = I.The method thus implicitly computes the GEVD of (A, B), but explicitly the generalized singular value decomposition (GSVD; see, e.g., [22,19]) of (F, G), with the generalized singular values forming the diagonal of Σ := Σ −1 G Σ F (all of them finite, since Σ G has a positive diagonal).Furthermore, the generalized singular values can be considered to be sorted descendingly by a symmetric permutation, i.e., Σ = P T 0 Σ P 0 , and thus U = U P 0 , V = V P 0 , and Z = Z P 0 , where FZ = U Σ F , GZ = V Σ G , and G Σ F constitute a decomposition of (F, G) possessing all other aforementioned properties.
The GEVD of (A, B) can be recovered by letting Λ := Σ 2 and noting that AZ = BZΛ, i.e., the columns of Z are the generalized eigenvectors, and the diagonal of Λ contains the generalized eigenvalues of (A, B).
If needed, the right generalized singular vectors X := Z −1 can either be computed from Z, or can be obtained simulta- neously with Z by accumulating the inverses of the transformations that have been multiplied to form Z [20] (i.e., if Z = ).The recent work [17] has shown that such method can be successfully blocked and parallelized for the CPUs with the shared memory, and for the clusters of those, albeit only the real matrix pairs were considered therein.Even the sequential but blocked version outperformed the GSVD algorithm in LA-PACK [1], and the parallel ones exhibited a decent scalability.
On the other hand, an efficient, parallel and blocked onesided Jacobi-type algorithm for the "ordinary" and the hyperbolic SVD [14,15] of a single real matrix has been developed for the GPUs, that utilizes the GPUs almost fully, with the CPU serving only the controlling purpose in the single-GPU case.
This work aims to merge the experience of those two approaches, and present a parallel and blocked one-sided (also called "implicit") Hari-Zimmermann algorithm for the GSVD on the GPU(s) as an extension of the latter, but for the complex matrix pairs as well as for the real ones.
Even though the research in parallelization of the GSVD has a long history [11,2], three novel and major differences from the earlier, Kogbetliantz-based procedures aim to ensure both the high performance and the high relative accuracy of this one: using the implicit Hari-Zimmermann algorithm as the basic method, that is blocked to exploit the GPU memory hierarchy, and the massive parallelism of the GPUs that suits the algorithm (and vice versa) perfectly.This paper continues with section 2, where the complex and the real one-sided Hari-Zimmermann algorithms are introduced, together with the general, architecturally agnostic principles of their blocking and parallelization.In section 3 the single-GPU implementation are described in detail, while in section 4 the most straightforward multi-GPU implementation approach is suggested.The numerical testing results are summarized in section 5, and the paper concludes with some directions for future research in section 6.In Appendix A a nonessential but computationally cheap way for enhancing the ac-curacy of the real and the complex dot-products on the GPUs is proposed.The full testing results and some details of the chosen Jacobi strategies are provided as the supplementary material.
2. The complex and the real one-sided Hari-Zimmermann algorithms In this section the complex and the real one-sided Hari-Zimmermann algorithms are briefly described.Please see [7,8] for a more thorough overview of the two-sided algorithms, and [17] for a detailed explanation of the real implicit Hari-Zimmermann algorithm.This paper closely follows the terminology and the implementation decisions of [20], where the complex generalized hyperbolic SVD based on the implicit Hari-Zimmermann approach has been introduced, but without the hyperbolic scalar products (i.e., the signature matrix J is taken to be identity here) and without forming the right generalized singular vectors X.
Let the matrices F and G be of size m F × n and m G × n, respectively, with min{m F , m G } ≥ n.Then, Z is square of order n, and assume that n ≥ 2. Otherwise, for n = 1, the GSVD of (F, G) is obtained by taking Even though the algorithm works on the rectangular matrices, it might be beneficial performance-wise to avoid transforming very tall and skinny (block)columns by working on the square matrices instead.To shorten F and G, the problem is transformed by computing the QR factorization of F with the column pivoting, FP 1 = Q F R F , and then G, with its columns prepermuted by P 1 , is shortened by the column-pivoted QR factorization, (GP 1 )P 2 = Q G R G .The square matrices F := R F P 2 and G := R G , both of order n, take the place of F and G in the algorithm, respectively.With Σ = Σ in the decompositions of (F, G) and of (F , G ), the matrix Z from the former, soughtfor decomposition can be recovered by using P := P 1 P 2 and the computed Z from the latter as Z := P Z .
It is assumed that diag(B) = I, i.e., the column norms of G are unity.Should it not be the case, F and G are then prescaled by a nonsingular, diagonal matrix Z 0 , where (Z 0 ) j j := 1/ g j F , g j is the jth column of G and 1 ≤ j ≤ n; otherwise, Z 0 := I.
The iterative transformation phase starts with the matrix pair (F 0 , G 0 ), where F 0 := FZ 0 , and G 0 := GZ 0 .Implicitly, A and B have been transformed by a congruence with Z 0 as A 0 := F * 0 F 0 and B 0 := G * 0 G 0 .

Simultaneous diagonalization of a pair of pivot matrices
An iteration (or "step") k ≥ 0 of the sequential non-blocked Hari-Zimmermann algorithm consists of selecting a pair of indices (i k , j k ), 1 ≤ i k < j k ≤ n, and thus two 2 × 2 pivot submatrices, one of , and one of , which are then jointly diagonalized by a congruence transformation with a nonsingular matrix Z k , to be defined in subsections 2.1.1 and 2.1.2,as However, due to the floating-point rounding errors, these equations might not hold.To prevent diag( B k ) to drift too far away from diag(I 2 ) as the algorithm progresses, the squared Frobenius norms of g i k ;k and g j k ;k could be recomputed Suppose that Z k has been computed (by either version) such that it diagonalizes A k and B k , but a i k i k ;k+1 < a j k j k ;k+1 .To keep diag( A k ) sorted descendingly, swap the columns of Z k by a permutation P k := 01 1 0 to obtain Z k := Z k P k .Such Z k will swap the i k th and the j k th columns of F k and G k as it orthogonalizes them.Sorting in each step is a heuristic that speeds up the algorithm notably in practice (see section 5), but it makes reasoning about the convergence harder and is not strictly necessary.
Computing Z k from A k and B k is more involved in the complex case than in the real one.However, in both cases, first it is established whether the i k th and the j k th columns of F k and G k are numerically relatively orthogonal, where ε is the precision of the chosen floating-point datatype 1 .
If they are, no non-trivial transformation is to take place, and Z k := P k , since still the column swap may be warranted.Rescaling by D k is thus not performed even for D k I 2 , since it might perturb the columns sufficiently enough for them to cease to be numerically orthogonal.

The complex case
The transformation matrix Z k is sought in a form [7,20] To that end, let and, noting that In these ranges of the angles, for θ ∈ {2ϑ k , γ k } the trigonometric identities cos θ = 1/(1+tan 2 θ) and sin θ = tan θ cos θ hold when θ < π/2.Otherwise, tan θ = ∞, cos θ = 0, and sin θ = 1.Then, compute cos(2ϑ k ), sin(2ϑ k ), cos γ k , and sin γ k , and with them finally obtain error for the dot-products [4] that form the elements of A k and B k , and while sensible in the real case, it is probably too tight in the complex case, where a more careful analysis of the complex dot-products might be employed in the future work and a handful of transformations subsequently might be skipped.

An exception. If v
and a i k i k ;k = a j k j k ;k , then tan γ k is undefined, and tan(2ϑ k ) might also be.In that case, it can be shown that A k and B k are diagonalized by

The real case
The transformation matrix Z k is sought in a form [7,17] To that end, let Note that cot(2ϑ k ) and cot ϑ k (and the corresponding tangents) have the same sign in the range of ϑ k .Assuming that the floating-point arithmetic unit does not trap on ±1/0 and 1/∞, obtain tan ϑ k as , and from it cos ϑ k and sin ϑ k using the same trigonometric identities as in the complex case.Finally, compute An exception.Since the real case is in fact a simplification of the complex case, when cot(2ϑ k ) is undefined, being 0/0, i.e., when a i k i k ;k = a j k j k ;k and a i k j k ;k = a i k i k ;k b i k j k ;k (or, in other words, when A k and B k are proportional), define

Parallelization of the one-sided algorithm
The sequential one-sided algorithm in each step chooses a single pivot index pair, according to some criterion that is called a sequential Jacobi strategy.However, at most n/2 pivot column pairs of each matrix can be transformed concurrently if the indices in all index pairs are distinct.
In a parallel step k ≥ 0 a sequence (i ( ) k , j ( ) k ) of pivot index pairs, where 1 ≤ ≤ n/2 , such that each index in the range from 1 to n appears at most (and for even n, exactly) once in the sequence, addresses n/2 pivot column pairs of A k and B k to be transformed-each pair by a separate, concurrent task.
All permutations of a given (i ( ) k , j ( ) k ) are equivalent from the numerical point of view, since the resulting A k+1 and B k+1 are the same for every reordering of the sequence, and therefore any reordering represents the entire equivalence class.
For simplicity, a barrier is assumed between the successive parallel steps, i.e., all tasks of a step have to be completed before those of the following step are started.
A criterion to choose a pivot index pair sequence for each parallel step is called a parallel Jacobi strategy.Among the strategies that are simplest to compute are the ones that prescribe a pivot sequence for each step, until all n(n − 1)/2 possible index pairs (i, j) are selected at least once.Then, the choice of the steps is periodically repeated.Let s be the shortest such period.The first s steps constitute the first sweep, the following s steps the second sweep, and so on.
If in any sweep exactly n(n − 1)/2 different index pairs are chosen, such a strategy is called cyclic; otherwise, some index pairs are repeated in a sweep, and the strategy is called quasicyclic.For even n, s ≥ n − 1, and the equality holds if and only if the strategy is cyclic.
A strategy is defined for a fixed n; however, by a slight abuse of the usual terminology, a single principle by which the particular strategies are generated for some given matrix orders will simply be called a strategy kind, or even a strategy for short.
Based on the previous experience with the one-sided Jacobilike algorithms, two parallel Jacobi strategy kinds have been selected for testing: the modified modulus (mm; see, e.g., [16,17]), quasi-cyclic with s = n, and the generalized Mantharam-Eberlein (me; see [12,14]) cyclic one.Please see Figures 1 and   2 in the supplementary material, where a sweep of me and of mm, respectively, is shown two-sidedly on a matrix of order 32.
, and Z j ( ) k ;k are the i ( ) k th and j ( )  k th block columns of F k , G k , and Z k of width w.Now, A ( )  k and B ( ) k can either be jointly diagonalized by a matrix Z ( )  k , which leads to the full block (fb) algorithm [10], as called in the context of the Jacobi methods, or their off-diagonal norms can be reduced by a sequence of congruences accumulated into Z ( )  k , which is called the block-oriented (bo) algorithm [9].The idea behind blocking is that A ( ) k , B ( ) k , and Z ( ) k fit, by choosing w, into the small but fast cache memory (e.g., the shared memory of a GPU), to speed up the computation with them, as well as employing BLAS 3 (matrix multiplies) operations for the block column updates by Z ( ) k afterwards: The computation of Z ( ) k in either fb or bo can be done by any convergent method; a two-sided method can be applied straightforwardly, but for the one-sided approach A ( ) k and B ( ) k have to be factorized first by, e.g., the Cholesky factorization and then the same implicit Hari-Zimmermann method, either pointwise or blocked, and in both cases, either parallel or sequential, can be recursively applied to F ( ) k and G ( ) k .In the single-GPU algorithm, there is only one level of such a recursion, i.e., one level of blocking.The block, outer level of the algorithm and the pointwise, inner level do not need to employ the same strategy kind.Both levels, however, are parallel.The sweeps of the outer level are called the block (or outer) sweeps, and those of the inner level are called the pointwise (or inner) sweeps, which for fb are limited to 30 ( A ( )  k and B ( ) k are usually fully diagonalized in less than that number of sweeps), and for bo are limited to only one inner sweep.Apart from that, there is no other substantial difference between fb and bo.
The Cholesky factorization is not the only way to form F ( ) k and G ( ) k .One numerical stability improvement would be to use a diagonally pivoted version of the factorization instead [21], or to refrain from forming A ( ) k and B ( ) k explicitly altogether, by shortening the pivot block columns by the column-pivoted QR factorization directly [20] F ( ) In both cases, let where P ( ) F;k and P ( ) G;k are permutation matrices, while Q ( ) F;k and Q ( ) G;k are unitary and are not required to be stored, implicitly or explicitly, for any further computation.
However, the QR factorization (even without the column pivoting) of a pair of the tall and skinny block columns comes with a significant performance penalty on a GPU compared to the Cholesky factorization of a small, square pivot submatrix [14], and the pivoted Cholesky factorization does not avoid a possibility of getting a severely ill-conditioned A ( ) k or B ( ) by multiplying an ill-conditioned pair of block columns by itself.Therefore, both of these enhancements are only briefly described here but have not been tested.
In the following, the blocked algorithm is assumed to form the pivot submatrices as e., each one by a ZHERK (DSYRK in the real case) like operation in the BLAS terminology, and the non-pivoted Cholesky factorization is then used to obtain F ( ) k and G ( ) k .

Rescalings
Observe that Z ( ) k is a product of several non-unitary matrices, elements of which can be larger than 1 by magnitude, so the norm of Z ( ) k can build up significantly by such accumulation of the transformations.Also, if Z ( ) k diagonalizes A ( ) k and B ( ) k , or reduces their off-diagonal norms, so does any matrix Z ( ) k is a real, diagonal matrix with its diagonal elements positive and smaller than 1.

Let Θ ( )
k be such a matrix that reduces the norm of , where F ( ) and G ( ) stand for the jth column of the final, transformed F ( ) k and G ( ) k , respectively, of which the latter has unit norm, and thus max j Θ ( ) This is exactly the same scaling as it would be performed in the last, post-iterative phase of the algorithm, do not have to be rescaled and the norms of their columns do not have to be kept as they are all discarded immediately after Θ ( ) k has been computed.
k , is applied to the pivot block column pair of F k , G k , and Z k instead of Z ( ) k , and is considered embedded into Z k in a similar way as Z ( )  k would be in the pointwise case, i.e., starting from Z k being I n , for each let where the subscripting is to be interpreted as in Fortran.
To reduce the norm of the entire Z k , a similar rescaling can be applied on Z k , using the column norms of F k and G k , after each but the last block sweep.After the last block sweep, a rescaling of all three matrices (F N , G N , and Z N ) is performed to obtain U, V, and Z, with the extraction of Σ F , Σ G , and Σ.

Convergence
The inner level of the algorithm stops when there were no transformations, apart from the sorting permutations, applied in a sweep, or when the prescribed maximal number of sweeps has been reached.Then, the pivot block column pairs of F k , G k , and Z k are updated concurrently for all by Z ( ) k , which can be skipped for those where Z ( ) k = I 2w .The same criterion could be used for the outer level, where the count of transformations applied in an outer sweep is a sum of all transformations applied in the inner level in all steps of the outer sweep.However, this criterion has to be relaxed [14,17], since the rounding errors in forming and factorizing the block pivot submatrices could spoil the attained numerical orthogonality of the original columns, and introduce a small number of unwarranted transformations that prevent the algorithm from detecting convergence even if it has in fact been reached.Therefore, the transformations are divided in two classes: "big" and "small".The latter are all Z ( ) k where either: i.e., where Z ( ) k is close to (a multiple of) identity, and the former are all other transformations.Note that neither definition of the small transformations implies that sin ϕ k or sin ψ k are numerically equal to zero (and are usually not).Also, since x k tends to zero and therefore t k to one in the last sweeps of the algorithm, the first and the second definition should not differ significantly.
There are separate counters of the big transformations, and of all transformations applied in the inner level of the algorithm.
The inner level still halts when there were no transformations of any class in a sweep, but the outer level stops when there were no big transformations applied in an outer sweeps (i.e., small transformations are allowed to occur but do not spoil the overall convergence).Such a heuristic criterion prevents in practice a long sequence of outer sweeps at the end of the algorithm, with only a few transformations close to identity in each.

Variants of the algorithm
To summarize the variants of the algorithm, see Table 1.
The first column, ID, sets a shorthand for the corresponding ID convergence transformations dot-products The second column specifies a convergence criterion used.The third column distinguished between assuming the column norms of the second matrix to be unity, and rescaling of both matrices with each transformation.The fourth column relates to computing the dot-products the usual way, or by an enhanced, possibly more accurate procedure from Appendix A. Unless specified otherwise, the column sorting is employed in all cases.Thus, e.g., DHZ3-(mm-fb-me) refers to the double-precision real implicit blocked Hari-Zimmermann algorithm with ID equal to 3, using mm at the outer and me at the inner level of blocking of type fb.Similarly, ZHZ0-(me-bome) stands for the double-precision complex implicit blocked Hari-Zimmerman algorithm with ID equal to 0 (the "standard" variant), using me at both levels of blocking of type bo.
From now on, when a numeric variant ID is mentioned in the text, it is assumed that it should be looked up in Table 1.

The single-GPU implementation
In this section the single-GPU implementation of the complex and the real one-sided Hari-Zimmermann algorithms are described.The focus is on the complex algorithm, and the real one is commented on when substantially different.The target framework is CUDA C++ [18] for the NVIDIA GPUs (Kepler series and newer), but also another general-purpose GPU programming environment with the analogous concepts and constructs could probably be used.

Data layout and transfer
Due to blocking employed by the algorithm, each matrix is viewed as column-striped, with the block columns containing w = 16 consecutive columns each.To simplify the implementation, assume that n is a multiple of 32, and let n := n/w (so n is even).If the assumption does not hold for the input, the matrices should then be bordered by appending 32 − (n mod 32) columns to the right, and as many rows to the bottom.The ele- . ., in the columns newly added to the matrix Y ∈ {F, G} should be set to unity, to avoid introducing zero columns, since it is essential for Y to be of full column rank.Other bordering elements should be set to zero, to prevent any transformation not implied by the original matrices from happening (see a bordering example in [16]).
Another assumption, to simplify the loop unrolling in various parts of the code, is to have m F and m G as a multiple of 64.
If it is not the case with the input, then, after a possible bordering as described above, 64 − (m Y mod 64) rows of zeros should be appended to the bottom of the matrix Y ∈ {F, G}.

The CPU and the GPU RAM layout and transfer
Data is laid out in the GPU RAM (also called "global memory" in the GPU context) in the following sequence: The vectors (Σ, Σ F , and Σ G with double-precision elements, each of length n, and C holding unsigned 8-byte integers, of length n), are output only, while the rest of data is used both for input and output, i.e., the six double-precision matrices are constantly being read and overwritten within the GPU as the algorithm progresses.The matrices are loaded to the GPU at the beginning of the algorithm's execution, if they are not already in place as a result of another computation, and optionally copied to the CPU at its end, as well as Σ, Σ F , and Σ G .
In the pre-and post-processing stages on the CPU, input (F, G) and output data (U in place of F; V in place of G; and Z), respectively, is repacked from (or to) the standard representation of complex matrices, in which the successive elements are complex numbers z = (Re(z), Im(z)).Each double-precision matrix can therefore be loaded to, or copied from, the GPU with a single CUDA call.
The purpose of splitting the real and the imaginary matrix parts is twofold.First, as there is no direct support for the standard C (with Complex types) or C++ (with std::complex types) complex arithmetic in CUDA, apart from the ones provided by the datatypes and the routines from the cuComplex.hheader file or the thrust library, a decision was made to keep all data in real-typed variables.Second, reading or writing only one component of the successive matrix elements is straightforward, and can be performed by a contiguous memory access.
In the auxiliary vector C there are two counters, C (0) and C (1) , where is the index of a thread block in a grid of the main computational kernel.In C (0) the count of the "big" transformations, and in C (1) the count of all transformations applied in all kernel launches within a single block sweep are accumulated.
At the beginning of each block sweep C is zeroed out on the GPU, and is copied to a CPU vector C at the end of the sweep.

The shared memory layout
For each thread block, the non-constant, non-register data (comprising three complex matrices: F, G, and Z) for the main computational kernel is laid out in the shared memory as: Each double-precision matrix is square, of order 32, with the elements stored in Fortran array order, for a total shared mem- The real case.In the real Hari-Zimmermann algorithm there are no Im(•) matrices present, and no repacking of the input and the output data is needed.The other properties of two data layouts still hold.The shared memory requirements are half of those for the complex algorithm, i.e., 24 kiB.

The constant memory layout
The constant memory on the GPU holds the pointers to the matrices and the vectors described above, with their dimensions, to avoid sending them as parameters in each kernel call.
The Jacobi strategy table for the first, pointwise level of the algorithm is also stored in the constant memory, since it does not depend on the actual input data.
The strategy table contains 31 (or 32) rows, same as the number of steps of a chosen (quasi-)cyclic parallel strategy.
Each row is an array of 16 index pairs (p, q), with p < q, where no two indices in a row are the same.A pair of such indices addresses a pair of columns of the matrices F and G to be transformed concurrently with all other column pairs in the step.

Constants in the global memory
The Jacobi strategy table for the second, block level of the algorithm might not fit in the constant memory for the large n, so it has to be stored in the global memory in such a case.It is similarly formatted as the table for the pointwise level, but with n − 1 (or n) rows, each with n/2 index pairs.Here, a pair (p, q), with p < q, addresses a pair of block columns of the matrices F and G.No two indices in a row are the same, i.e., every integer between 0 and n − 1 appears exactly once in a row.Each row encodes a step of the chosen block level (quasi-)cyclic parallel strategy, which does not have to be of the same kind as the one chosen for the pointwise level.
Both tables are precomputed on and preloaded from the CPU [15,20] before any computation starts on the GPU.

Arithmetic operations
Since the data is held in the real-valued arrays only, the complex arithmetic is performed manually, computing the real and the imaginary parts of the result separately, rather than assembling the complex operands in the CUDA format each time an operation has to be performed, and disassembling the result when it has to be stored back in memory.

Complex arithmetic
The arithmetic operations on complex numbers needed by the algorithm are addition, subtraction, negation, complex conjugation, multiplication by a complex or a real number (or an inverse of the latter), and taking the absolute value.Only |z|, a•b, and an FMA-like operation a • b + c (i.e., a complex multiplication and an addition "fused") require special attention, while the rest are trivial to express using the real arithmetic directly in the code.For c = 0 it holds zfma(a, b, c) = zmul(a, b) for all a and b.

Real arithmetic
The real arithmetic uses operations with the accuracy guarantees mandated by the IEEE 754 standard for floating-point arithmetic in rounding to nearest (ties to even) mode, except in the optional enhanced dot-product computation, where rounding to −∞ is also employed, as described in Appendix A.
A correctly rounded (i.e., with the error of no more than half ulp) double-precision rsqrt(x) := 1/ √ x device function, provided by Norbert Juffa in private communication, that improves the accuracy of the CUDA library routine of the same name, is called wherever such an expression has to be computed.

Integer arithmetic
To keep the memory requirements low, the pointwise level indices in the strategy table are stored as unsigned 1-byte integers, while the block level indices occupy 2 bytes each (i.e., n ≤ 65536, what is enough to exceed the RAM sizes of the present-day GPUs).
For dimensioning and indexing purposes the unsigned 4byte integers (after a possible promotion) are used, since their range allows for addressing up to 32 GiB of double-precision floating-point data, which is twice the quantity of GPU RAM available on the testing hardware.However, 8-byte integers should be used instead if the future GPUs provide more memory than this limit.
Although Fortran array order is assumed throughout the paper and the code, the indices on a GPU are zero-based.The CUDA thread (block) indices blockIdx.x,threadIdx.x,and threadIdx.yare shortened as b x , t x , and t y , respectively.

Initialization of Z (with optional rescaling of F and G)
Here, the first of three computational kernels, initFGZ 2 , is described.Its purpose is to initialize the matrix Z, having been zeroed out after allocation, to Z 0 , a diagonal matrix such that (Z 0 ) j j := 1/ g j F , and to rescale F and G to F 0 and G 0 , by multiplying the elements of each column j of the matrices by (Z 0 ) j j , in the variants 0, 1, 4, and 5. Else, Z 0 = I n .
The kernel is launched once, before the iterative phase of the algorithm starts, with a one-dimensional grid of n/2 thread blocks, each of which is also one-dimensional, with 64 threads (i.e., 2 warps of 32 consecutive-numbered threads).
Each warp is in charge of one column of F, G, and Z, i.e., its threads access only the elements i of that column j, where .Such a computation occurs in the variants 0 and 4, while in the variants 1 and 5 the enhanced dot-product computation as in Appendix A updates the per-thread, register-stored partial . After a pass over the column completes, s[t x ] are formed according to the rules of Appendix A and summed as above.
Either way, z j [t x ] := 1/ g j 2 F is then computed, and the jth columns of F and G are scaled by z j [t x ] in a loop similar to the one described above, i.e., for i in steps of 32 while i < m F , and then the same scaling is performed on G, with i < m G .
Finally, z j [t x ] is written to Re(Z) l j by the lowest-numbered thread in a warp, i.e., t x ≡ 0 (mod 32), where l is an index making a physical column j treated as a logical column l.In the single-GPU case, l = j.In the variants 2, 3, 6, and 7, Re(Z) l j is set to 1 and no other processing occurs.This and any other computation of the Frobenius norm of a vector via the sum of squares of its elements could overflow even if the result itself would not.See [14, Appendix A] for one of several possible remedies (not implemented here).The kernel's grid is identical, and the operation very similar to initFGZ.First, f j 2 F is computed, and if non-unity and f, f j is scaled by 1/ f j 2 F is computed, and if non-unity and f, g j is scaled by 1/ g j 2 F , as well as , and 1, z j is scaled by θ[t x ], as well as Σ F; j [t x ] and Σ G; j [t x ] to obtain Σ F; j [t x ] and Σ G; j [t x ]; else, Finally, if f, Σ j , Σ F; j , and Σ G; j are written to the GPU RAM by a thread t x ≡ 0 (mod 32).All variables indexed by t x above are per-thread and register-stored, unless a register spill occurs.

The main computational kernel
The main kernel comes in bstep1s and bstep1n versions, where the former is the default one, with the column sorting, while the latter is a non-sorting version.
The kernel is called once per a block step, and that call, followed by an explicit device-wide synchronization (cudaDe-viceSynchronize) constitutes the entire block step.The synchronization occurs also after overwriting the memory with zeroes, initFGZ, and rescale calls.From the CPU perspective, the algorithm is therefore purely sequential.
The kernel's grid is one-dimensional, with n/2 two-dimensional thread blocks, each of them having 32 × w = 512 threads.
A thread block := b x in the block step k := k mod n is in charge of one pivot block column pair, (p ( ) k , q ( ) k ), of F, G, and Z, where n is n − 1 for the me or n for the mm strategy kind.Since no two pivot block index pairs share an index, all thread blocks can be executed concurrently without any interdependencies or data races.Due to the shared memory requirement, a high thread count and a heavy register pressure, it is not expected that more than two (or, in the real case, three) thread blocks could share a single GPU multiprocessor.Therefore, for the matrices large enough, only a fraction of all thread blocks in the grid can execute at the same time on a GPU.It is a presumption (but not a requirement) that the CUDA runtime shall schedule a thread block for execution at an early opportunity after a running one terminates, thereby keeping the GPU fully occupied despite of the possible execution time variations (i.e., the number of the inner sweeps and the transformations required) among the thread blocks, especially in the fb case.
Note that t y addresses a warp, 0 ≤ t y < w, and t x , 0 ≤ t x < 32, denotes a lane (a thread) within the warp.Throughout a thread block, each warp is in charge of two "ordinary" (i.e., not block) columns, in the global or in the shared memory, but of which two varies between and within the subphases.

Subphase 1 (two ZHERK or DSYRK like operations)
The task of this subphase is to form A ( ) k and then B ( ) k in the shared memory, occupying Re(F) (and Im(F)), and Re(G) (and Im(G)), respectively, by a single pass through the pivot block columns of F k and G k .The resulting matrices are Hermitian in theory, but unlike in BLAS, both the strictly lower and the strictly upper triangle of each matrix are explicitly computed, even though only the lower triangle is read in the subphase 2, thus avoiding a possible issue with one triangle not being the exact transpose-conjugate of the other numerically.
A warp indexed by t y is assigned two column indices, p ( ) y;k and q ( ) y;k , in the range of the first and the second pivot block column, respectively, as Each thread holds four register-stored variables, initially set to zero, that hold the real (first two) and the imaginary (last two) parts of two (partial) dot-products of the columns of F k and, in the second instance, of G k , where t y := t y + w.
In a loop over i, starting from i := t x and terminating when i ≥ m F , with i := i + 64, in each step two consecutive chunks of 32 rows (i.e., 64 rows) of the columns p ( ) y;k and q ( ) y;k are read from Re(F k ) and Im(F k ) into Re(F 64×32 ) and Im(F 64×32 ).Each lane reads an element from the global memory and writes it into the shared memory, both in the coalesced manner, four times per chunk.The elements of the column p ( ) y;k are stored into the t y th column, and those of the column q ( ) y;k are stored into the t y th column of the shared memory buffer.The elements of the first chunk are stored into the t x th row, and of the second chunk into the (t x + 32)th row of the buffer.The thread block is then synchronized, to complete filling the buffer by all warps.
An unrolled inner loop over j, 0 ≤ j < 64, updates the local partial dot-products.A synchronization call follows the loop.
For each j, let t x := (t x + j) mod 64, and Then, two fused multiply-add operations perform the updates, The first updates constitute a computation of the dot-product of t x th and t y th column of F 64×32 and updating the partial sum z y with it, while the second ones form the dot-product of the t x th and t y th column and update z y with it.Note that all the rows of the buffer are read exactly once, albeit in the modular (circular) fashion throughout the loop, with the different starting offsets in each column to minimize the shared memory bank conflicts.
When the outer loop over i terminates, z y and z y are stored into F 64×32 at the corresponding indices, and a synchronization barrier is reached, thus finalizing the formation of A ( ) k .The same procedure is repeated with G k instead of F k to obtain B ( )  k , substituting G and G for F and F, respectively, in the procedure described above.Note that F 96×32 could (however, unclear if it should) be used instead of F 64×32 , i.e., three chunks instead of two would be read into the buffer and the dot-products of the columns of length 96 instead of 64 would be computed.That would not be possible, though, for G, since A ( ) k , once formed, must not be overwritten until the next subphase.

Subphase 2 (two ZPOTRF or DPOTRF like operations)
The Cholesky factorization of A := A ( ) k or B := B ( ) k consists of two similar, unrolled loops over j.The matrix (in Fortran array order) is accessed and transformed columnwise to avoid the shared memory bank conflicts, but then a transpose-conjugate operation must follow on the computed lower triangular factor to obtain the corresponding upper triangular one.Along with the transposition-conjugation, the strictly lower triangle is zeroed-out, since the following subphase makes no assumptions about the triangularity of the initial matrices.
The first loop iterates over 0 ≤ j < w.First, the jth diagonal element of Re(A), a j j , is read (the imaginary part is assumed to be zero) if t y = j and t x ≥ j (i.e., in the threads of the jth warp which correspond to the lower triangle, called the "active" threads), and the thread block is then synchronized.
The active threads then scale the jth column below the diagonal, each thread the real and the imaginary part of its element in the t x th row, by 1/ a j j , while the diagonal is set to ( a j j , 0), and the thread block is then synchronized.
Next, the columns to the right of the jth have to be updated, with all warps (but not all their threads) participating in the update.Let j := ( j + 1) + t y .Then, if t x ≥ j , and the thread block is synchronized.However, this only updates the columns from j + 1 to j + w.The same update has to be performed with j := j + w instead of j , i.e., if t x ≥ j (which also ensures that j < 32), and another thread synchronization occurs.
The second loop over w ≤ j < 32 is identical to the first one, except that t y is used instead of t y and the second updates (of the j th columns) are not needed since j ≥ 32.
The ensuing transpose-conjugate with zeroing-out of the strictly lower triangle is performed by reading A[t x , t y ] and A[t x , t y ] into the register of the [t x , t y ]th thread if t x ≥ t y and t x ≥ t y , respectively (i.e., the indices belong to the lower triangle of A).Otherwise, those registers are set to 0. After negating the imaginary parts in the former case, the values are written to A[t y , t x ] and A[t y , t x ], respectively, unfortunately requiring the shared memory bank conflicts, and the thread block is synchronized, yielding F := F ( ) k .The same procedure is then repeated with B instead of A, yielding G := G ( ) k .

Subphase 3 (the pointwise one-sided algorithm)
The pointwise implicit Hari-Zimmermann algorithm is described in section 2, subsections 2.1, 2.2, and the relevant parts of subsections 2.4 and 2.5, and implemented as follows.
If D 11;l 1, the first row of Z l is scaled by D 11;l .If D 22;l 1, the second row of Z l is scaled by D 22;l .Now the completed transformation Z l has to be applied to the pivot columns: where Y ∈ {F, G, Z}.If one or both scaled cosines lying on the diagonal of Z l are equal to one, the transformation can be (and is) simplified by removing the corresponding multiplications without numerically affecting the result.
In bstep1s, to determine if the column swap is required, the squares of the norms of the transformed columns of F are computed as the sum-reduced sums of squares of the magnitudes of the new (F ) elements, depending on the variant.Those two values are however not stored for the next step, because that would require an additional shared memory workspace that might not be available on all supported architectures.
In the real case it is easy to compute instead the transformed diagonal elements of the first pivot submatrix directly [17]: At the end of a sweep, if ŝ = 0, the loop is terminated.Else, the counters S and B, set at the start of this subphase to zero, are incremented by s and b, respectively.
The same rescaling as in subsection 3.4 with f=false, but performed on the shared memory, yields Z ( ) k .Using the last values of F [t x , i] and G [t x , i], the squares of the norms of the ith column of F and G , respectively, are computed.Then, Z [t x , i] is read (or its last value is used), scaled by 1/ f i 2 , stored, and the thread block is synchronized.The same procedure is repeated with j instead of i, giving Z ( ) k := Z .Finally, a thread with t x = t y = 0 stores S and B into C as and the thread block is synchronized.

Subphase 4 (three postmultiplications)
In this subphase the pivot block columns of F k , G k , and Z k are multiplied by Z ( )  k and overwritten by the respective results.Each multiplication of a pair of pivot block columns (residing in the global memory) by Z ( )  k (residing in the shared memory in Z) and the following update are performed by a single pass over (i.e., a single read from and a single write to) the block columns, using the Cannon-like algorithm [3] for parallel multiplication of two square matrices.
Reading the chunks of a block column pair from the global memory is identical to the one from the subphase 1 in subsection 3.5.1,except that in each iteration of the outer loop (over i) only one chunk is read to Y, instead of two (which would also be a possibility).The number of loop iterations (in parenthesis) depends on the number of rows of F k (m F /32), G k (m G /32), and Z k (n/32).Here, Y is F when updating F k and Z k , and G when updating G k .The thread block is then synchronized.
The per-thread variables to hold the product of the current chunk with Z are set to zero.Each thread is in charge of forming the elements with indices [t x , t y ] and [t x , t y ] of the product Π.
The initial skews are defined as ı := (t y + t x ) mod 32 and ı := (t y + t x ) mod 32.Then, in each iteration of the unrolled inner loop over 0 ≤ j < 32 the local elements of Π are updated, after which ı and ı are cyclically shifted as ı := (ı + 1) mod 32 and ı := (ı + 1) mod 32.When the inner loop terminates, the thread block is synchronized.This procedure is called thrice to update F k , G k and Z k , after which the kernel execution (i.e., the kth outer step) terminates.

Dataflow and the shared memory perspective
In Figure 1

The CPU part of the algorithm
Algorithm 1 summarizes the CPU-side actions with a single GPU.The same routine is called within the multi-GPU algorithm, except for allowing S to be a parameter (not the constant 30 as it is assumed here), a variation of the final rescaling of Z, and some differences regarding the initFGZ call, described in subsections 3.3 and 4.2.The copy-ins and copy-outs of the majority of data, as well as the initialization of the constant memory, are left out from Algorithm 1 but are included in the single-GPU timing in subsection 5.
Apart from the copy-ins, copy-outs, and zeroings of data, there is no scope for using any non-default CUDA streams.Also, as no GPU computation, except the fast rescale, can be overlapped with any CPU task, the execution time of the algorithm depends almost solely on the GPU performance and the time required to set up a kernel call.for very large matrices, which have not been so implemented.
Note that the algorithm stops when b = 0, i.e., when no big transformations occurred in an outer sweep.The "global" counts S and B of all and of big transformations applied during the execution of Algorithm 1 are only informative here, but they are consulted in the multi-GPU algorithm's stopping criterion.

The multi-GPU implementation
When the input data is larger than the available RAM on a single GPU, it is necessary to either split the workload among Figure 1: The sequence of operations performed on the shared memory of a GPU by a thread block of the main computational kernel.
the several GPUs, or resort to some out-of-core technique for swapping the parts of data in and out of the GPU as the computation progresses.Here, only the former approach is followed, since it is simpler, more efficient and widely applicable now when the multi-GPU installations are becoming ubiquitous.
There is no single, best and straightforward way of generalizing a single-GPU algorithm to multiple GPUs.For the (ordinary and hyperbolic) SVD, the approach in [14] was to distribute the matrix over the GPUs, shorten the assigned part of the matrix, run the single-GPU algorithm on the shortened part, update the original (non-shortened) columns, and redistribute the parts.Despite its decent performance, such a three-level algorithm suffered from the increased memory usage and some numerical difficulties, both with the stopping criteria and with the relative accuracy obtained.
A different approach is taken here, with a goal of achieving the optimal GPU and CPU memory usage (without any work arrays) and a better accuracy, but with a possible performance penalty induced by transforming the tall and skinny parts of the matrices directly, without any shortening.As no floatingpoint computation is performed on the CPU, while the GPU computation still does not rely on any numerical libraries, it is guaranteed that the results stay bitwise identical in the repeated runs over the same input data with the same parameters and the same number of GPUs.

Algorithm setup
In this subsection the multi-GPU computational environment, the input and the output data distribution across the CPUs and the GPUs, the communication-aware Jacobi strategies, and the algorithm's initialization are explained.

MPI environment
Unlike in [14], where the multiple GPUs were assumed to belong to the same node, and thus a separate CPU thread of a single process could be dedicated for driving the computation on each GPU, here a more flexible solution has been chosen, by assigning to a GPU a single-threaded MPI [13] process.The GPUs can thus be selected from any number of nodes, with a different number of them on each node.Also, the GPUs are not required to be architecturally identical or even similar across the processes in an MPI job, as long as they all have enough RAM available.
The count of GPUs, and thus the governing MPI processes (s), for the multi-GPU algorithm is not constrained in principle, save for being at least two (otherwise, the single-GPU algorithm is sufficient), and small enough so that at least two (but for the reasons of performance, a multiple of 32) columns of each matrix are available to each process, when the matrices are divided among them columnwise, as described below.The upper bound on the number of GPUs is a tunable parameter in the code, while the MPI implementation might have its own limit on the number of processes in a job.
The MPI processes need not be arranged in any special topology, i.e., only the predefined MPI COMM WORLD communicator is used.A GPU and its governing process are jointly referred to by the process' rank (r) in that communicator.
It is assumed that the MPI distribution is CUDA-aware in a sense that sending data from the GPU RAM of one process and receiving it in the CPU RAM of another (including the same) process is possible with the standard MPI point-to-point communication routines (i.e., no manual CPU buffering of the GPU data is necessary).
Also, the number of elements of each local submatrix has to be at most INT MAX, which at present is the upper limit on the count of elements that can be transferred in a single MPI operation [6].That limit is easily circumvented by transferring the data in several smaller chunks, but such chunking has not been implemented since it was not needed for the amount of RAM (16 GiB) of the GPUs used for testing.That issue will have to be addressed for the future GPUs.

Data distribution
The matrices F, G, and Z, and the vectors Σ F , Σ G , and Σ, are assumed to always stay distributed among the MPI processes, i.e., at no moment they are required to be present in entirety in any subset of the processes.The amount of the CPU and the

Algorithm initialization
First, the CPU memory is allocated in each process, and the data is loaded (e.g., from a file), assuming k = 0, i.e., the rth process contains the p 0,r th and q 0,r th stripes of F, G, and Z.
Then, the device memory is allocated, if it is not already available, and an MPI barrier is reached.Timing of the algorithm includes everything that occurs from this barrier on, except the final, optional deallocation of the device memory.
The constant memory on each GPU is then set up, and the stripes are copied to the device (global) memory, all of which could be, but has not been, done asynchronously.The device is then synchronized, becoming ready for the computation.
It remains to be decided how many sweeps S in Algorithm 1 to allow.As with the pointwise level, there are two obvious choices: either some reasonably large number, e.g., 30 (as in fb), or 1 (as in bo).Now a variant of the multi-GPU algorithm is specified by the selected variant of the single-GPU algorithm, with the outermost strategy and the choice of S added; e.g., ZHZ0-(me-bo, me-fb-me) for me and S = 1, respectively, using ZHZ0-(me-fb-me) at the single-GPU level.
As shown in subsection 5.3.2, the imbalance of the computational time each GPU requires with fb (i.e., one GPU may need more sweeps in Algorithm 1 than another to reach convergence within an outermost step) is significantly detrimental to the overall performance-contrary to the single-GPU case (see subsection 3.5).Unlike there, where such imbalance between the thread blocks' sweep counts is offset by a large number of thread blocks to be scheduled on a small number of multiprocessors, here in the multi-GPU case there is a one-to-one correspondence between the number of tasks to perform and the number of execution units (GPUs) to perform them, so the time required for an outermost step depends on the slowest run of Algorithm 1 within it.Thus, bo is recommended here instead.

The main part of the algorithm
In the pre-iterative part of the algorithm, initFGZ kernel is called (see subsection 3.3), once in each process.Specifically, it is not called again in the context of Algorithm 1.Here, the row offset l in initFGZ is calculated according to the logical (not physical) index of a column, i.e., l := j + p 0,r • w if j < w, and l := j − w + q 0,r • w otherwise, with p 0,r and q 0,r sent to the kernel as parameters.The device is then synchronized, and the stripes of F 0 , G 0 , and Z 0 are ready on each GPU (copying them to the CPU is not needed) for the iterative part of the algorithm.The first guiding principle for such a design of the communication is to keep it as general as possible.For example, any process topology (including no topology in particular), suggested by the communication pattern of the chosen Jacobi strategy can be accommodated with equal ease.The second principle is to facilitate, as much as practicable, hiding the communication overhead behind the GPU computation.Before a call of Algorithm 1 occurs within an outermost step of a given process, the non-blocking receives to the CPU stripes are started in anticipation of an early finish of the GPU work of the step in the processes that are to send their transformed stripes to the process in question.That way, while the given GPU still computes, its CPU can in theory start or even complete receiving one or both transformed stripes required in the following step.There remains an issue with several slowly progressing processes that might keep the rest of them idle, but at least the point-to-point data transfers can happen soon after the data is ready, not waiting for a massive data exchange with all processes communicating at the same time.
The third principle is to minimize the memory requirements of both the CPUs and the GPUs by sending the transformed data from the GPU RAM of one process to the CPU RAM of another two.That way, no separate, "shadow" GPU buffers are required to receive the data.The CPU stripes have to be present anyhow,  r) , F [r] ), (G (r) , G [r] ), (Z (r) , Z [r] )} do // (host,device) matrices foreach V ∈ {Re, Im} do // real or imaginary part copy V(W) to V(Y) using cudaMemcpy2D; // can be done asynchronously using cudaMemcpy2DAsync  r) , G (r) , Z (r) as in Algorithm 2; call the single-GPU Algorithm 1 on F [r] , G [r] , Z [r] with the chosen S ; start sending the transformed F [r] , G [r] , Z [r] as in Algorithm 3; complete the communication and copy the received F (r) , G (r) , Z (r) to F [r] , G [r] , Z [r] , as in Algorithm 4; end for reduce the transf.counters across the communicator and break if the convergence has been reached, as in Algorithm 5; end for rescale(true); cudaDeviceSynchronize(); // full rescaling of Z [r]   optionally, copy G , Σ (r) and cudaDeviceSynchronize(); MPI Barrier(MPI COMM WORLD); // completion of the algorithm and its timing a few sample runs of the multi-GPU algorithm on a small matrix have been tried on a combination of those two GPUs, with the code built for both architectures, to ensure that the algorithm functions correctly in such a heterogeneous environment.
The testing data is synthetic (not from any application domain) and is described in subsection 5.1.2.More information on its availability can be found in the supplementary material.

Testing environment
The testing environment comprises two Intel Xeon Silver 4114 CPUs, 384 GiB of RAM, and four NVIDIA Tesla V100-SXM2-16GB (Volta) GPUs per node, with a 64-bit Linux (Cen-tOS 7.5.1804), the GCC 4.8.5 C++ compiler, CUDA 10.0, and a build of Open MPI 3.0.0distribution with the CUDA support.

Testing data
Two datasets have been generated: a "small" and a "large" one, with their names referring to the orders of the square matrices forming the pairs contained in them.The small dataset contains both the real and the complex matrix pairs, with each matrix stored in (and then read from) its unformatted binary file, while the large dataset contains only the complex matrix pairs.
The small dataset has 19 matrix pairs for each datatype, The real matrix pairs in the small dataset were generated in quadruple datatype (REAL(KIND=16) in Intel Fortran) and rounded to double precision datatype.The same test generation method was employed as in [17].The required BLAS and LAPACK routines had been adapted as required.The core of the generation are two quadruple-adapted LAPACK testing routines: xLAGSY, that generates a pseudorandom symmetric matrix, here of the full bandwidth, from a given diagonal prescribing the eigenvalues of the matrix; and xLAROR, that here multiplies a given matrix from the left by a pseudorandom orthogonal matrix.The diagonals of Σ F , Σ G , and Λ X were generated by DLARND, a standard LAPACK's pseudorandom num-ber generator, here with the uniform probability distribution on (0, 1), such that only those values returned by it that had been greater than 10 −10 were accepted.Then, UΣ F was generated from Σ F , and VΣ G from Σ G , both using xLAROR, while X was obtained from Λ X using xLAGSY.After that, F := UΣ F • X and The generator finishes with a call to the preprocessing part, DGGSVP3, of the LAPACK's GSVD method (see [1] and the routine's comments), to make the data usable for comparison with DTGSJA Kogbetliantz-type GSVD routine from LAPACK.Such a comparison, however, has never been made since the DTGSJA computation on any available CPU would be prohibitively expensive, as it was indicated in [17] and recently confirmed (albeit only in the complex case) in [20].
The complex matrix pairs in both datasets were generated by a much simpler procedure, described in [20].Namely, each matrix in a pair was generated by a call to ZLATMS LAPACK testing routine as Hermitian and positive definite, with its pseudorandom eigenvalues uniformly distributed in (0, 1).

Testing results for the single-GPU algorithm
When presenting the performance of several variants, a decision has been made to show the execution time plots and tables for the fastest variant on the largest matrix pair in the small dataset, separately in the real and in the complex case.To compare those results with the ones from other variants, a useful measure is the relative slowdown on a given matrix order.Fixing the reference variant r, as suggested by the results in the plots, for another variant v and a matrix order n let T (r) n and T (v) n be the execution times of r and v on a matrix pairs with the matrices of order n, respectively.The relative slowdown S (v:r) n of v compared to r on n, given in percentages of the execution time of r, is The average relative slowdown across the entire dataset (with 19 matrix pairs here) is given as S (v:r) avg := n S (v:r) n /19.

Performance in the real case
In Figure 2, the wall execution time of four subvariants of DHZ0 (the fastest of eight variants from   3: The wall execution time of four subvariants of ZHZ4 and the number of outer sweeps for two fastest subvariants on the small complex dataset.

Accuracy in the complex case
In Figure 5 the normwise relative errors of four subvariants of ZHZ0 are shown in a logarithmic scale, alongside the spectral condition numbers κ 2 (F) in the small complex dataset, offering a justification for a smooth shape of the error graphs.
In Table 5 the spectral condition numbers κ 2 (G) in the small complex dataset are given, but without a corresponding error figure, since it is almost indistinguishable from Figure 5.  2 and 3 it is clear that the execution times across the variants do not widely differ, and that the enhanced dot-products from Appendix A add from a few percent to something more than 15% to the wall time.
Generally, in both the real and the complex case the variant 0 with (me-fb-me) is a reasonable choice performance-wise.
Regarding the normwise relative errors on the matrices of moderate spectral conditions, as it is in the real case, the mm subvariants are almost indistinguishable, as are the me subvariants, with the latter being slightly more accurate than the former, as shown in Figure 4.With the matrices of small spectral condition, as it is in the complex case, all subvariants are almost indistinguishable, as shown in Figure 5.
In Table 6 the maximal relative normwise errors with respect to F and G in the real and the complex case on the small dataset for all variants of the single-GPU algorithm with (mefb-me) are given.Taking a look at the minimal value in each data column, it is evident that, except for G in the real case,    the enhanced dot-products offer a small advantage in accuracy, but not so significant that it would not be offset by a drop in performance when the focus is on the latter.

Testing results for the multi-GPU algorithm
Due to a limited availability of multiple GPUs in the testing environment, only the complex case in the variant 0 was tested on the large dataset and compared with a single-GPU baseline.
Here, the full accuracy testing was skipped, due to the huge computational demands of the matrix inversions and of the error calculation in extended precision.For n = 18432, the relative errors in the corresponding outputs with one and with two GPUs (in both cases in all variants that were timed) were compared and all of them, for both F and G, were found to be less than 1.7 • 10 −12 .Also, for a given matrix, the relative errors in all cases differed less than 10 −13 , indicating that the multi-GPU algorithm did not introduce any instability in the computation.

A single-GPU baseline
In Table 7 the wall time in seconds and the number of the outermost sweeps are shown for the me-fb-me and the me-bome variants, with and without the column sorting, of the ZHZ0 single-GPU algorithm, as a baseline for the comparison with the multi-GPU algorithm.As the similar benefits of the column sorting were obvious in other trial tests runs, the non-sorting version was not considered for subjecting it to the full testing.

The multi-GPU performance
In Tables 8, 9 test was not possible to be run due to an insufficient amount of the GPU RAM, "n/a" is shown in the test's table cell.8, 9, 10, and 11, that the multi-GPU variant B is to be recommended when performance matters.
Dividing the shortest single-GPU baseline wall time from Table 7 for n = 18432 with the wall times from Table 9 for the same n but with a different number of GPUs, it can be derived that the speedup with two GPUs is 1.28×, with four GPUs is 1.90×, and with eight GPUs is 2.66×.These speedups could be even lower on a slower network or if there were fewer than four GPUs per node in the testing environment, but nevertheless they are satisfactorily increasing with the number of GPUs.
It would be interesting to see for what number of GPUs the speedup for this and other test cases peaks or levels out, but that is unfortunately beyond reach of the testing environment.

Future work
Several generalization of the algorithms' design are possible for the other implicit Jacobi-type methods that are to be ported to the GPU(s).One such method is a computation of the generalized hyperbolic SVD (GHSVD) by a modification of the implicit Hari-Zimmermann algorithm, as described in [20].
while letting Z k be the identity matrix elsewhere, then looking two-sidedly the congruence with Z k transforms the pair (A k , B k ) into a pair (A k+1 , B k+1 ), where A k+1 := Z * k A k Z k and B k+1 := Z * k B k Z k .Onesidedly, the transformation by Z k orthogonalizes the i k th and the j k th pivot columns of F k and G k to obtain F k+1 := F k Z k and G k+1 := G k Z k .Also, Z k is accumulated into the product Z k+1 := Z k Z k .In a one-sided sequential step only the i k th and the j k th columns of F k , G k , and Z k are effectively transformed, in-place (i.e., overwritten), postmultiplying them by the 2 × 2 matrix Z k , while the other columns of these matrices remain intact: A k and B k it is then possible to compute Z k , with the final Z k := D k Z k .In this version of the algorithm it is not necessary to rescale the columns of F and G by Z 0 at the start, since such rescaling happens at each step, so Z 0 := I.If D k = I 2 , this version is equivalent to the standard (previously described) one, for which it can be formally set A k := A k and B k := B k .
and define sign(a, b) to be |a| with the sign of b for a and b real.Then, let t k := 1 − x 2 k , set

2. 3 .
Blocking of the one-sided algorithm Parallelization alone is not sufficient for achieving a decent performance of the algorithm on the modern architectures with multiple levels of the memory hierarchy.The pointwise algorithm just described is therefore modified to work on the block columns of the matrices, instead of the columns proper.Each block column comprises an arbitrary but fixed number w, 1 < w < n/2 , of consecutive matrix columns.Instead of 2 × 2 pivot submatrices of A k and B k , in the blocked algorithm 2w × 2w pivot submatrices A ( ) k and B ( ) k are formed in the kth (parallel or sequential) step by matrix multiplications, A ( ) 48 kiB for a thread block.The shared memory is configured with 8-bytewide banks.No other kernel requires any shared memory.Let Re(F 64×32 ) stand for the contiguous memory space occupied by Re(F) and Re(G); Im(F 64×32 ) for Im(F) and Im(G); Re(G 64×32 ) for Re(G) and Re(Z); and Im(G 64×32 ) for Im(G) and Im(Z), as the real and the imaginary parts of F 64×32 and G 64×32 matrices that share the same storage with F, G, and Z.Such overlapping of data is necessary for the formation of F and G from the block columns of F and G, respectively, as described below.Also, let Re(F 96×32 ) stand for Re(F), Re(G), Re(Z); and Im(F 96×32 ) for Im(F), Im(G), Im(Z).
For multiplication, an inlineable routine (zmul) computes z := a • b and returns the result via two output-only arguments, referring to Re(z) and Im(z).With the CUDA FMA intrinsic fma rn, abbreviated as FMArn, it holds Re(z) = FMArn(Re(a), Re(b), −Im(a) • Im(b)), computed in a way that requires 3 floating-point operations but 2 roundings only.Note that the operations are ordered arbitrarily, thus zmul could also be realized by multiplying the real parts of the factors first.Im(z) is obtained by Im(z) = FMArn(Re(a), Im(b), Im(a) • Re(b)), where only 2 instead of 3 floating-point operations are required, with 2 roundings, and again the choice of the real product arguments is arbitrary.In total, 5 operations (of which the negation is trivial) instead of the usual 6 are needed.The FMA-like operation is modeled after the CUDA one in the cuComplex.hheader.Let z := a • b + c.Then, zfma routine requires 3 operations with 2 roundings for Re(z) = FMArn(Re(a), Re(b), FMArn(−Im(a), Im(b), Re(c))), and 2 operations with 2 roundings for Im(z) = FMArn(Re(a), Im(b), FMArn(Im(a), Re(b), Im(c))).

3. 4 .
Rescaling of Z and extraction of U, Σ F , V, Σ G , and Σ After each block sweep, another kernel, rescale, is called, with a Boolean flag f indicating whether it is the last sweep.If f is false, only Z is rescaled according to the rules of subsection 2.4, and otherwise the full results of the GSVD computation (U, Σ F , V, Σ G , and Σ) are produced.
The computational subphases of bstep1(s/n)(k), 1. formation of A ( ) k and B ( ) k in the shared memory, 2. the Cholesky factorizations of A ( ) k and B ( ) k as F ( ) * pointwise implicit Hari-Zimmermann algorithm on the matrix pair ( F ( ) k , G ( ) k ), yielding Z ( ) k , 4. postmultiplication of the pair of pivot block columns of F, G, and Z by Z ( ) k , are all fused into a single kernel to effortlessly preserve the contents of the shared memory between them.All the required matrix algebra routines have been written as device functions with the semantics similar to, but different from the standard BLAS, due to the data distribution and the memory constraints.For example, a single call of the BLAScompatible ZHERK (or DSYRK in the real case) operation for the subphase 1 is not possible, since the two pivot block columns do not have to be adjacent in the global memory.The subphase 3 cannot use a single standard ZGEMM (or DGEMM) call for the same reason, but also because the block columns have to be overwritten in-place to avoid introducing any work arrays.
11;l • A 11;l + 2 • Z 11;l • Z 21;l • A 12;l + Z 2 21;l • A 22;l , a 22 := Z 2 12;l • A 11;l + 2 • Z 22;l • Z 12;l • A 12;l + Z 2 22;l • A 22;l , and to swap the ith and the jth column when a 11 < a 22 .If the norm of the ith column is smaller than the norm of the jth column, then the values of Y [t x , i] and Y [t x , j] are swapped via an intermediary variable.Else, or in bstep1n, no swaps occur.The values of the new variables are then stored in the shared memory, and b is uniformly incremented, b := b + syncthreads count( b)/32, across the thread block.The lth step is now complete.
figures, (i-k), show the three postmultiplications taking place, with a chunk of data read from and written to the global memory.Two different hues of the first and the middle parts of the shared memory depiction indicate that the chunks with 64 instead of 32 rows might be used there.

4. 2 . 1 .
Point-to-point communication and reductions Except for a single collective MPI Allreduce operation required per an outermost sweep, all other communication in the algorithm is of the non-blocking, point-to-point kind, occurring in every outermost step.The communication parts of the algorithm, from a given process' perspective, are formalized in Algorithms 2, 3, 4, and 5, and put together in Algorithm 6.

Algorithm 5 :Algorithm 6 :
Convergence criterion (no big transformations anywhere) checking in the cth sweep of the rth process.MPI Allreduce({ S c , B c }, { s S c , s B c }, 2, MPI UNSIGNED LONG LONG, MPI SUM, MPI COMM WORLD); // +-reduction if s B c = 0 then break; // Is the sum of all per-process, per-sweep big transformation counters 0? The CPU part of the multi-GPU implicit Hari-Zimmermann algorithm (for the rth process).initFGZ(p 0,r , q 0,r ); // compute F [r] 0 , G [r] 0 , Z [r] 0 for 0 ≤ c < 30 do // outermost sweep c { S c , B c } := {0, 0}; // per-process, per-sweep counters of all and of big transformations for 0 ≤ k < n do // outermost step k start receiving into F (

S
c := S c + S; B c := B c + B; // increment the transf.counters by those from Algorithm 1 with the orders of the matrices ranging from 512 to 9728 in steps of 512.The large dataset has 3 matrix pairs, with the orders of the matrices being 18•1024 = 18432, 24•1024 = 24576, and 36 • 1024 = 36864, so that the GPU RAM requirements do not exceed the memory provided by one, two, and four GPUs, respectively.No matrices in either dataset require bordering.

Figure 5 :
Figure5: The relative normwise errors, F − UΣ F X F / F F , of four subvariants of ZHZ0, and the spectral condition numbers κ 2 (F) in the small complex dataset.

3 A
, 10, and 11 the wall time in seconds and the outermost sweep count are shown for the multi-GPU variants := ZHZ0-(me-fb, me-fb-me), B := ZHZ0-(me-bo, me-fb-me), C := ZHZ0-(me-fb, me-bo-me), D := ZHZ0-(me-bo, me-bo-me), respectively, run on two, four, and eight GPUs.The tests on two and four GPUs required a single node, and those on eight GPUs required two InfiniBand-connected nodes.When a particular

Table 1 :
Variants of the implicit, blocked Hari-Zimmermann algorithm.
32 , i mod 32 = t x mod 32.A warp reads 32 consecutive elements of Re(G) j and Im(G) j at a time.Each thread updates its register-stored partial sums ĉ r [t x ] := ĉr [t x ] + Re(G) 2 i j , ĉ i [t x ] := ĉi [t x ] + Im(G)2 i j , using one FMA operation for each update, and this is repeated by going to rows i := i+32 until i ≥ m G .Initially, i = t x mod 32 and ĉr [t x ] = ĉi [t x ] = 0.After passing through the entire column, those partial sums are added to obtain ŝ[t x ] := ĉr [t x ] + ĉi [t x ].Then, ŝ[t x ] are summed and the result is distributed across the warp by a warp-shuffling [18] sum-reduction, described in Appendix B, yielding the sum of squares of the magnitudes of the elements in the column, i.e., g j 2 F The local values of Π now have to be written back to the global memory, whereΠ[t x , t y ] overwrites Y k [i, p ( ) y;k ], while Π[t x , t y ] overwrites Y k [i, q ( ) y;k ],for Y being one of {F, G, Z}.The thread block is then synchronized and the next outer iteration, if any are left, follows.
Algorithm 2: The non-blocking receives in the kth step of the rth process.tag:=1; i := 0; // tag tells which stripe from a sender has to be received; i indexes requests The non-blocking sends in the kth step of the rth process.//variablei is assumed to hold the last value assigned to it in Algorithm 2 in the kth step Completion of the communication and the host-to-device data transfers in the kth step of the rth process.//variable i is assumed to hold the last value assigned to it in Algorithm 3 in the kth step MPI Waitall(i, r, statuses); // wait for all pending MPI Irecv and MPI Isend requests to complete Table 1 on the largest matrix pair), the number of outer sweeps for two fastest subvariants on the small real dataset, and the detailed execution time table for DHZ0-(me-fb-me), which is almost always the fastest of the four subvariants, are shown.

Table 2 :
The intervals of relative slowdown of other real single-GPU (me-fb-me) The wall execution time of four subvariants of DHZ0 and the number of outer sweeps for two fastest subvariants on the small real dataset.
In Figure4the normwise relative errors of four subvariants of DHZ0 are shown in a logarithmic scale, alongside the spectral condition numbers κ 2 (F) in the small real dataset, offering

Table 3 :
The intervals of relative slowdown of other complex single-GPU (me-fb-me) variants compared to ZHZ4-(me-fb-me).A negative slowdown is a speedup; e.g., ZHZ0 is faster than ZHZ4 on average, even if slower sometimes.
In Table4the spectral condition numbers κ 2 (G) in the small real dataset are given, but without a corresponding error figure, since it is almost indistinguishable from Figure4.

Table 4 :
The spectral condition numbers κ 2 (G) in the small real dataset.

Table 5 :
The spectral condition numbers κ 2 (G) in the small complex dataset.

Table 6 :
The maximal relative normwise errors with respect to F and G in the real and the complex case on the small dataset for all variants of the single-GPU algorithm with (me-fb-me).The minimal value in each column is shown in bold.10 −12 3.70732 • 10 −12 6.89432 • 10 −13 6.89366 • 10 −13

Table 7 :
Wall time in seconds and the sweep count for the me-fb-me and the me-bo-me variants of the ZHZ0 single-GPU algorithm, with and without the column sorting, on a pair of matrices of order n = 18432.

Table 8 :
Wall time in seconds and the outermost sweep count for the variant A.

Table 9 :
Wall time in seconds and the outermost sweep count for the variant B.

Table 10 :
Wall time in seconds and the outermost sweep count for the variant C.

Table 11 :
Wall time in seconds and the outermost sweep count for the variant D.