Extending lyapack for the solution of band Lyapunov equations on hybrid CPU–GPU platforms

The solution of large-scale Lyapunov equations is an important tool for the solution of several engineering problems arising in optimal control and model order reduction. In this work, we investigate the case when the coefficient matrix of the equations presents a band structure. Exploiting the structure of this matrix, we can achive relevant reductions in the memory requirements and the number of floating-point operations. Additionally, the new solver efficiently leverages the parallelism of CPU–GPU platforms. Furthermore, it is integrated in the lyapack library to facilitate its use. The new codes are evaluated on the solution of several benchmarks, exposing significant runtime reductions with respect to the original CPU version in lyapack.


Introduction
Lyapunov (matrix) equations appear in a number of engineering and scientific problems as, for example, in model order reduction and many areas of control theory; see [1,8,14]. The solution of Lyapunov equations has been widely studied and several efficient and numerically reliable methods can be found in the literature. In many applications, the Lyapunov equations are of large dimension, and, in consequence, their solution involves a large amount of data and a vast computational cost. In response, the use of High Performance Computing (HPC) techniques and architectures is mandatory. To illustrate this, the solver provided by lyapack makes an intensive use of tuned HPC kernels in blas and lapack.
In a set of applications, the coefficient matrix of the Lyapunov equation exhibits a specific structure: symmetry, banded, etc. In such cases, the Lyapunov solver can be specialized to exploit the benefits of the related structure to reduce either the memory requirements, the computational cost, or both.
In recent years, graphics processing units (GPUs) have shown remarkable performance in the computation of large-scale matrix operations, and particularly, in the solution of matrix equations; see [4,5] among others. In addition to its performance, GPUs present other interesting properties, such as a low Watt per floating point arithmetic operation ratio and an affordable price.
In this paper, we address the solution of large Lyapunov equations with a banded coefficient matrix. Our approach exploits the structure of the matrix and the ample hardware parallelism of the GPUs to obtain a Lyapunov solver that can be applied to the solution of large-scale problems. Additionally, we have integrated the proposal in the lyapack library [13].
The rest of the paper is organized as follows. In Sect. 2, we introduce the Lyapunov equations and lyapack toolbox. The proposed solver is presented in Sect. 3 and evaluated in Sect. 4. Finally, some concluding remarks are outlined in Sect. 5.

Solution of Lyapunov equations
Consider the Lyapunov matrix equation with factored right hand side: where A ∈ R n×n is the coefficient matrix, B ∈ R n×m , and X ∈ R n×n is the soughtafter solution. Such matrix equations play a key role in a few control theory applications. For example, the solution of one or more Lyapunov equations is required by SVD-based methods for balanced truncation model reduction [1,6], Newton's methods for the algebraic Riccati equation associated with linear-quadratic optimal control problems [2], stabilization methods and stability tests for linear dynamical systems as well as the computation of the H 2 -norm of stable linear control systems [8,10]. Complex control systems or discretizations of partial differential equations easily lead to matrices of order n = 10 4 to 10 5 or even larger.
In this work, we focus on the solution of (1) when A is a band matrix, meaning that all the nonzero entries of A reside in a narrow band, including the main diagonal and a set of consecutive super-and sub-diagonals.
We investigate the solution of Lyapunov equations via the LR-ADI (low-rank Alternating Direction Implicit iteration) method [12], which exploits the low-rank property of B [1] to compute a low-rank approximation to a full-rank factor of X . This approach reduces notoriously the memory requirements and the number of arithmetic operations needed, facilitating the solution of large-scale problems.
Specifically, given an "l-cyclic" set of complex shifts { p 1 , p 2 , . . .}, with p k = p k+l , p k = α k + β k j , and j = √ −1, the cyclic LR-ADI for the Lyapunov equation can be formulated as follows: where γ k := √ α k+1 /α k , δ k := p k+1 + p k , p k stands for the conjugate of p k , and I n denotes the identity matrix of order n. Upon convergence, afterk iterations, a low-rank matrix Sk ∈ R n×km is obtained such that X ≈ Sk S T k . The convergence of the LR-ADI method can be accelerated by a careful selection of the shifts { p 1 , p 2 , . . .}. In practice, the computation of these parameters involves an Arnoldi iteration with the coefficient matrix A and its inverse. For further details on the convergence of the LR-ADI iteration and the properties of heuristic vs. optimal procedures to select the shift parameters, see [12,15].
From the computational point of view, the cost of the algorithm lies in two basic linear algebra operations: the matrix-matrix products to compute the shifts, and the solution of the linear systems in Eq. 2. Tuned kernels that exploit the band structure of A report relevant savings in both operations, and hence, in the Lyapunov solver.

The lyapack toolbox
lyapack [13] is a MATLAB toolbox which offers a variety of numerical routines for the solution of Lyapunov and Riccati equations, model order reduction, and optimal control. The toolbox is appropriate for large and sparse problems and/or structured dynamical systems. Two main properties of lyapack are its user-friendly interface and a flexible modular architecture which facilitates its extension and customization.
The Lyapunov solver in lyapack is the cornerstone for most of the problems addressed by the toolbox. The solver is an efficient implementation of the LR-ADI method. The principal routines in lyapack perform the most expensive computations in terms of three types of basic matrix operations (BMOs): Z := AY , Z := A −1 Y , and Z := (A + pI n ) −1 Y . In all cases, one of the inputs of the BMO is the coefficient matrix of the Lyapunov equation, A, while Y is a matrix with a small number of columns and p is a scalar. lyapack's strategy to accommodate coefficient matrices with different nonzero patterns (e.g., sparse, tridiagonal, band,…) is a reverse communication interface also adopted, e.g., by ARPACK and PETSc. Thus, it is the user's responsibility to supply a tuned kernel to compute each BMO (hereafter, referred to as user-supplied functions or USFs) that efficiently leverages the nonzero pattern of the problem on a specific target architecture.
In summary, the use of USFs yields a soft object-oriented polymorphism. Data from the matrix A are stored into hidden global variables, and the purpose of the USFs is to manipulate these structures (e.g., create and release the global data) and compute the corresponding BMOs. In consequence, the user has the opportunity to adapt the solver to the particular features of a given problem and the hardware architecture, providing tuned USFs, while the numerical methods and properties in lyapack are preserved.

CPU-GPU version of lyapack
In [9], we provided a general framework to use CUDA kernels from lyapack with minimal changes to the library. Precisely, the design philosophy underlying this toolbox gives us the opportunity to leverage the acceleration of a GPU without modifying the contents of the library, by implementing the appropriate CUDA-enabled USFs. Concretely, this solution leverages the MATLAB Mex API that provides interoperability between MATLAB and other programming languages such as C, Fortran or, in our case, CUDA. This approach also permits to invoke hybrid multi-threaded and CUDA codes that simultaneously execute tasks on both CPU and GPU, attaining an efficient use of the available hardware resources. The use of Mex files allowed us to explicitly control the transfers between the CPU and GPU memory address spaces; offering a level of optimization that is difficult to match with tools that automatically off-load MATLAB routines to the GPU such as, e.g., AccelerEyes' Jacket toolbox. While the same effect can be achieved via MATLAB internal tools, the copy overhead incurred with the Mex interface is negligible compared with the cost of the CPU-GPU transfers and the computation time of the routines. If A is a band matrix, then L and U are triangular band matrices. In particular, let k l and k u denote, respectively, the lower and upper bandwidths of A; then, L is a lower triangular band matrix with bandwidth k l , and U is an upper triangular band matrix with bandwidth k l + k u . Therefore, exploiting the structures of A, L and U yields important reductions in the computational cost of the three steps. The library lapack provides a specific routine, gbsv, for the solution of band linear systems that exploit the structure of A in all the computations. 1 The lapack solver relies on the packed storage and the specific kernels for band matrices in blas. The use of a multi-threaded implementation of blas provides a parallel efficient solver for current CPUs.
The storage format for band matrices defined in blas permits a regular access pattern and, at the same time, allows important memory savings (see Fig. 1 for an example). Nevertheless, it also constrains the performance of gbsv (see [3]).
In a previous work [3], we presented a GPU-based solver for band linear systems that mitigate the limitations encountered in the lapack routines. In particular, we introduce minor changes to the matrix storage scheme that, at the cost of a moderate increase in the memory requirements (see Fig. 1, right), allows to overcome the limitations encountered in routine gbsv and reports relevant gains in terms of performance and execution time. The solver is based on two hybrid CPU-GPU kernels that compute the LU factorization of a band matrix and solve a band triangular system. The implementation includes several HPC techniques, reaches a remarkable performance, and outperforms a lapack-like routine based on tuned cublas kernels. However, it still presents some drawbacks: -Since it is a three-stage method, two synchronization points are mandatory.
-The matrix A is fully accessed twice in the routine, first during the computation of the LU factorization and later for the solution of the two band triangular systems.
We next present a solver with a single synchronization point and a reduced number of memory accesses. The central idea is a reorganization of the procedure, merging two stages: the factorization of A and the solution of the triangular system with L. This new approach presents the following advantages: -Data locality: we anticipate the solution of the linear system involving L, as this operation is performed when the procedure is still manipulating these blocks. -Concurrent execution: we can overlap two operations on two different processors, and therefore we can hide the computational cost of the least time-consuming one. -Less memory accesses are required, since the lower triangular part of A (L) is accessed only once.

Band matrix-matrix multiplication
The band matrix-matrix product of the form is an operation with a large intrinsic parallelism. It is a level-3 blas operation and hence, high performance can be expected from it. The blas specification includes a routine to compute the band matrix-vector product (gbmv), but none for the computation of a band matrix-matrix product. Consequently, libraries such as intel mkl or nvidia cublas that implement the blas specification provide efficient implementations of the band matrix-vector product but not for the related matrix-matrix product.
The effortless approximation to compute the related matrix-matrix product is to use the routine gbmv iteratively (one for each column of C B). However, this approach leads to unnecessary memory accesses and an inefficient use of the memory hierarchy in the architecture. The drawbacks from this naive implementation can be overcome using a blocked algorithm. Next, we present such a blocked algorithm, gbmm B L K , and two implementations that off-load the most expensive computations to the GPU.

Algorithm gbmm B L K
The algorithm is divided into two loops, see Figs 3 and 4. For simplicity, we assume momentarily that α = β = 1 in (3). The outer loop (Fig. 3) partitions the matrices B and C into blocks of c columns; at each iteration of the loop, the elements in the active column-block of C are computed. The inner loop progresses along the columns of C, from left to right, computing its elements. Figure 4 shows the operations performed in the inner loop. The matrices B and C are partitioned row-wise, while A requires a 3 × 2 → 5 × 3 re-partition. At each iteration, blocks A 11 , A 21 and A 31 are accessed. One relevant property of this partitioning is that A 11 and A 31 are lower and upper triangular, respectively. Figure 2 details the blocks accessed and updated at a given iteration of the inner loop.
gbmm B L K employs two blocksizes, one per loop: c defines the number of columns of C computed in a given iteration of the outer loop; and b is the number of columns of A that are accessed at each iteration of the inner loop. These two parameters need to be tuned to optimize the execution of the algorithm for a particular target architecture.  (C, A, B, ku, k  The main advantage of algorithm gbmm B L K is that it can be implemented using blas-3 kernels and hence, a high performance can be expected from it.

Implementation gbmm
The gbmm routine is a straight-forward implementation of algorithm gbmm B L K . In an initial stage, all the matrices are initially copied to the device. Then, βC is computed using the scal routine in cublas. This is not a blas-3 operation, but presents a relatively small computational cost and can be efficiently computed on the device due to its inherent parallelism. Next, the product of matrices is computed, and finally C is transferred back to the CPU. The update of E 2 (see Fig. 4) requires a general matrixmatrix product (gemm), an operation supported by cublas. The updates of the blocks E 1 and E 3 require two matrix products involving triangular matrices, A 11 and A 31 , respectively. Routine trmm of cublas implements this operation in the form but unfortunately differs from the functionality required by gbmm B L K , since (4) overwrites C with the result. To overcome this problem, routine gbmm performs two operations to update E 1 : where D 1 has b rows; E 1 has 0 rows if m(D 0 ) < (ku + 1) and has b rows otherwise; E 3 has 0 rows if m(D 0 ) > (n(A) − k l − 1) and has b rows otherwise; Fig. 4 Inner loop of the algorithm that computes C := A · B + C, with A a band matrix with upper and lower bandwidth k u and k l , respectively Next to each operation in (5)-(6) is the cublas routine that supports it. The update of E 3 is analogous. This procedure requires a reduced auxiliary storage W ∈ R b×c .
High performance should be expected from this implementation since it fully employs tuned cublas routines.

Implementation gbmm ms
This implementation is based on the modified storage scheme (see Fig. 1, right) and presents some benefits over gbmm. In the modified storage scheme, the strictly lower triangular part of A 31 is conveniently placed in the added rows, as is the strictly upper part of A 11 . In consequence, there is no need to operate with them as triangular matrices and the updates of all the blocks in E can be performed in the same manner, specially via matrix-matrix products (routine gemm from cublas). In fact, this allows to go a step further, and the updates of [E 1 T , E 2 T , E 3 T ] T can be merged and performed by means of a single matrix-matrix product. This approach presents two main advantages: it performs a reduced number of invocations to cublas kernels, and it avoids operations involving small triangular matrices.
On the other hand, there are also two drawbacks related to this implementation. First of all, the memory requirements are enlarged. In addition, the number of arithmetic operations is also increased, as it operates with the null elements that lie in the strictly upper and lower triangles of A 11 and A 31 .

Experimental evaluation
We evaluate the performance of the new codes for the solution of model order reduction problems extracted from the Oberwolfach benchmark collection [11]. In particular, from three instances of the same problem that differ in the dimension of the coefficient matrix; see Table 1. As the coefficient matrix in the RAIL problem is a sparse but not band matrix, we employ the Reverse Cuthill-McKee method [7] to transform this data into a band matrix. The execution times of the proposed solver are compared with those obtained with the lyapack CPU-based solver. We also compared the performance of our band-based solvers with GPU implementations of two preconditioned iterative solvers, based on ILU0+BiCGStab built using NVIDIA's CUSPARSE and AMG+BiCGStab using CUSP. Although not reported here, the results of these experiments demonstrated the superiority of our band solvers for the RAIL cases.
The performance evaluation was carried out using two different hardware platforms: An intel i7-2600 processor at 3.3 GHz with 8 GB of RAM connected to a nvidia S2070 (Platform I), and an intel i3-3220 processor at 3.4 GHz with 16 GB of RAM connected to a nvidia K20 (Platform II). In both platforms, we use the CentOS Rel. 6.4 O.S., the gcc v4.4.7 compiler and MATLAB R2011b. The CPU solver makes an intensive use of kernels from the intel mkl 11.1, while the hybrid CPU-GPU solver is built upon intel mkl 11.1 and nvidia cuda/cublas 5.0 kernels. Table 2 shows the results obtained by the hybrid CPU-GPU and the CPU-based lyapack solvers. The first three columns show the platform, the instance of the RAIL problem, and the solver employed in that order. The execution time required by the USFs and the whole solver are reported in the next columns, while the last column shows the speed-up of the CPU-GPU implementation with respect to the CPU one.
These results exhibit the relevance of the USFs in the entire process, since in average they require more than 97 % of the total time. Thus, the optimization efforts must focus on the USFs. The new solver exhibits a higher scalability, yielding better speed-ups for larger problems. In particular, the small problem presents a 1.4× speedup, while the largest problem evaluated reports a speedup up to 5.4×. The best time of both solvers is obtained in Platform I. This is explained by the superiority of the CPU in Platform I. The CPU solver in lyapack is more than 75 % faster in the platform equipped with the intel i7-2600 processor. Also, the hybrid CPU-GPU solver benefits from the more powerful CPU in Platform I. Although most of the computations in the hybrid solver are off-loaded to the GPU, it is still a hybrid code with synchronization points between  the CPU and the GPU processors. Nevertheless, in the hybrid approach, the relevance of the CPU processor is mitigated and, hence, while the CPU solver is nearly 2× faster in Platform I, the hybrid CPU-GPU solver is only 40 % faster. Note also, that the GPU in Platform II is more powerful than that of Platform I.

Concluding remarks
We have presented new hybrid CPU-GPU routines that accelerate the solution of band Lyapunov equations by off-loading the computationally expensive operations to the GPU. We extended the lyapack toolbox with a CPU-GPU solver that includes an efficient data manipulation to minimize transfers between host and device memories, a tuned band matrix-matrix multiplication reformulated to leverage blas-3 operations, and a novel band linear system solver based on a tailored storage scheme, look-ahead techniques and overlapped CPU-GPU computations. The experimental results obtained using three test cases (with dimensions that vary from 5,177 to 79,841, and bandwidths from 131 to 550), extracted from the Oberwolfach benchmark collection, on two platforms equipped with an nvidia 2070 and an nvidia K20 GPU, respectively, reveal speed-ups up to 5.4× when compared with the corresponding solver based on the intel MKL library. They also show the relevance of the USFs in the solver, as they require more than 97 % of the total time. In consequence, the lyapack design permits to effectively adapt the solver to the specific features of a given problem.
As part of future work, we plan to enhance the performance of Lyapunov solvers for other structured matrices, e.g., symmetric band matrices, and integrate these efforts into M.E.S.S. library (the successor of lyapack). Furthermore, we plan to study the impact of the new GPU-accelerated algorithms on energy consumption.