Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

,


Introduction
Every time a new architecture arrives, there is a mad dash for high performance.Since compilers, languages, and tools are still rudimentary, this means that some experts roll up their sleeves and achieve high performance the old-fashioned way: they earn it.The problem is that often there are only a few with the right expertise and interest, and therefore this yields only a few routines that are highly optimized.Furthermore, it is acceptable for code that achieves high performance to be messy.When others then come into the picture, they use such implementations as their inspiration, meaning that programmability does not enter the picture until much later in the game.In this paper, we show how insights from the FLAME project, in particular the importance of having a family of algorithms at one's disposal, allow considerable performance gains to be attained with minimal effort.We do so by focusing on the familiar and important matrix-matrix operations that are part of the Basic Linear Algebra Subprograms (BLAS) [5] and targeting the NVIDIA family of GPUs.
The arrival of NVIDIA's GPUs and IBM's Cell Broadband Engine and the recognition that they can be used for computation outside of the field of graphics has created the latest gold rush for performance.In scientific computing this has meant that considerable effort has been expended on implementing the most important kernel: matrix-matrix multiplication (gemm).Very admirable performance has been achieved [13].
Yet, even operations that are very similar to gemm, e.g., the other level-3 BLAS, did not achieve decent performance in the CUBLAS library for the NVIDIA GPUs when we started this study.Worse, the effort required to achieve the high performance for gemm is daunting enough that experts like ourselves have stayed on the sideline, focusing our efforts on using the gemm implementation for high-level operations like Cholesky factorization by using the accelarators only to compute subproblems that are matrix-matrix multiplications [10,9,8].We all hoped that soon other functionality would be ported to the GPUs, but that some other poor soul would do it for us.
In this paper we once again show that as new functionality and optimizations appear, there are, for those of us who have an aversion to hard work, opportunities to quickly and easily help improve performance in the short run while simultaneously prototyping how performance can eventually be improved by those who are willing to code at a lower level.
This paper is organized as follows: In Section 2 we briefly review the three commonly encountered matrix-matrix multiplication algorithms and use this to remind the reader of the FLAME notation for presenting algorithms.In Section 3 we discuss the corresponding algorithms for various level-3 BLAS operations, where these algorithms have been modified to take advantage of special structure in the matrices.The benefits of picking the right algorithmic variant is illustrated in Section 4. Concluding remarks are found in the final section.

Matrix-Matrix Product
At the top level, there are three variants of matrix-matrix product, which we have come to refer to as matrix-panel product (gemm mp) based, panel-matrix product (gemm pm) based, and (outer) panel-panel product (gemm pp) based (also known as rank-k updating) [6].We will discuss these briefly in this section, so that we can refer to them later as we discuss algorithms for the other matrix-matrix operations.
In Figure 1, we illustrate the gemm mp based algorithm.At the beginning of the iteration, C 0 and B 0 have already be updated and used, respectively.In the current iteration the next panel of matrix C is updated: Then, the advancement for the next iteration shifts C 1 and B 1 to the next blocks of data making blocks C 0 and B 0 larger since they contain more processed data.This visual description of the algorithm motivates the algorithm, in FLAME notation, given in Figure 2. In that figure, we also give the gemm pm and gemm pp based algorithms.Although all three perform the same number of floating point operations, the final performance that is achieved can be very different depending on the matrix shapes and cache subsystems.
It is well-known that for each operation there are algorithms that cast most computation in terms of matrix-matrix multiplication, as was pioneered in [7].Moreover, as part of the FLAME project we have long advocated that it is important to have multiple algorithmic variants at our disposal so that the best algorithm can be chosen for each situation [4].The FLAME methodology advocates systematic derivation of these variants [3,11].In Section 4 we will show that this is again the case for GPUs.We view our ability to rapidly develop different algorithms as a way of performing software acceleration, the natural (and much needed) counterpart to hardware acceleration.It yields a cheap (in terms of effort) boost to performance.
The algorithms presented in this section correspond naturally to the matrix-matrix multiplica- where A 1 has b rows, C 1 has b rows where A 1 has b columns, B 1 has b rows tion algorithms given in Section 2, except that they take advantage of the special structure of one of the matrices.Thus, the ... pp algorithm corresponds to the Gemm pp algorithm, etc.

Continue with
Algorithm: Symm pp pm(C, A, B)

SYMM Operation
For this operation we focus on C := AB +C , where A is symmetric and only the lower triangular part of this matrix is stored.In Figure 3 we give three algorithmic variants.
SYRK Operation We will focus on a representative case of this operation: C := C − AA T , where C is symmetric and only the lower triangular part of this matrix is stored and computed.In Figure 6 we give three algorithmic variants for this operation.

SYR2K Operation
For this operation we focus on C := C −AB T −BA T , where C is symmetric and only the upper triangular part of this matrix is stored and computed.In Figure 7 we give three algorithmic variants.

TRMM Operation
For this operation we focus on C := AB + C , where A is upper triangular.In Figure 8 we give three algorithmic variants.
TRSM Operation For this operation we focus on XA T = B , where A is lower triangular and B is overwritten with the solution X.In Figure 9 we give three algorithmic variants.
Other Cases of the BLAS-3 Operations The same technique can be applied to the other cases of the BLAS-3 operations above presented.Similarly, the same technique can be applied to other BLAS-3 operations.We expect achieving similar results since the issues are the same.

Experimental Results
The current revision of this paper primarily differs from the original one in that it compares performance against CuBLAS 2.2, which was released immediately after our first version of this note was published.And this is where Einstein's quote becomes relevant: we repeated the same experience, and expected different results. . .In addition, we have developed and evaluated other BLAS-3 operations, and we have evaluated our methods on rectangular matrices, as they are often used in LAPACK.
The target platform used in the experiments was a NVIDIA T10 GPU (a single GPU of a four GPU NVIDIA Tesla S1070) with 4 GBytes of RAM.The system is connected to a workstation with one Intel Xeon QuadCore E5405 processors (4 cores) at 2.83 GHz with 8 GBytes of DDR2 RAM.CUBLAS Release 2.2 and single precision real floating-point arithmetic were employed in the experiments.Performance is measured in terms of GFLOPS (billions of floating-point operationsflops-per second).The time to transfer data from the host to the memory of the GPU has not been included in the performance results.
Figure 4 reports the performances for the operations reported in the previous section on square matrices.Left plots show the performance in GFLOPS of both the new methods and the corresponding CUBLAS routines; right plots summarize the speedups obtained by the new operations against the corresponding routines in CUBLAS.All new codes achieve about 300 or more GFLOPS on square matrices.
Figure 5 reports the performances obtained by some operations on rectangular matrices, as they are often used inside LAPACK codes.Again, left plot shows the raw performance in GFLOPS of both new methods and CUBLAS routines; right plot summarizes the speedups obtained by the new operations against the corresponding routines in CUBLAS.
The results in both figures show the benefits of our approach.We believe them to be representative of other cases of the presented level-3 BLAS (those where matrices may have been transposed and/or stored in the other triangular part of the array) and the other level-3 BLAS.
The improvement of performances could have been even larger if we had used storage-byblocks, a well-known modification used in more recent software.We did not employ it to keep full compatibility with NVIDIA CUBLAS.

Conclusion
We have demonstrated that with relatively little effort considerable performance gains can be attained when new architectures arrive.The key is to pay attention to the fact that there are many different algorithmic variants for the same operation and to program them in a productive manner.The programs we wrote for this paper required a few hours of time and could have been developed by a relative novice.Undoubtedly, in response to this paper, there will be a flurry of activity to further improve the performance of the CUBLAS by coding at a much lower level and throwing programmability out the door.In this case, we have indirectly made a contribution to the scientific computing community because faster libraries will then become available sooner.But we are confident that this just means that we will be able to write yet another paper on how to improve the performance of high level routines, with functionality similar to that of LAPACK [1].And before you know it, a new shift in computer architecture will come along and the mad dash will start all over again.Thus the quote from Einstein.
We are working on an tool, FLAMES2S [12], that can automatically translate algorithms represented in code with the FLAME/C API, used to implement our libflame library [14], to low-level code that uses loops and indexing.This tool could easily generate the code that was created manually for the experiments in this paper.With that, we will make further progress towards overcoming the programmability problem for this class of operations and codes.
Recently Bientinesi developed a symbolic system for automatically generating families of algorithms from a high-level description of the target operation [2].Central to this result is a methodology for deriving an algorithm from a given loop-invariant.Combining the automatic system with FLAMES2S would make it possible to derive algorithms and code from loop-invariants, simplifying dramatically the developments of programs.
where A 1 has b columns

Figure 1 :
Figure 1: A visualization of the algorithm for matrix-panel variant of matrix-matrix product.Dark background means block is already processed.

Figure 4 :
Figure 4: Performances (left) and speedups (right) of the new implementations and equivalent CUBLAS routines on square matrices.

Figure 5 :
Figure 5: Performances (left) and speedups (right) of the new implementations and equivalent CUBLAS routines on rectangular matrices.

Algorithm:C 1 C 1 C 11 := C 11 + A 1 B T 1 + B 1 A T 1 2 C 12 := C 12 + B 1 A T 2 CC
Syr2k mp(A, B, C) C T R C BL C BR where A T has 0 rows, B T has 0 rows, C T L is 0 × 0 while m(A T ) < m(A) do Determine block size b Repartition 00 C 01 C 02 C 10 C 11 C 12 C 20 C 21 C 22   where A 1 has b rows, B 1 has b rows, C 11 is b × b Syr2k mp C 01 := C 01 + A 0 B T 01 := C 01 + B 0 A T Syr2k pm C 12 := C 12 + A 1 B T 11 := C 11 + A 1 B T 1 + B 1 A T 00 C 01 C 02 C 10 C 11 C 12 C 20 C 21 C 22 Syr2k pp(A, B) Partition A → A L A R , B → B L B R where A L has 0 columns, B L has 0 columns while n(A L ) < n(A) do Determine block size b Repartition A L A R → A 0 A 1 A 2 , B L B R → B 0 B 1 B 2where A 1 has b columns, B 1 has b columns