Analytical Modeling Is Enough for High-Performance BLIS

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).


INTRODUCTION
The field of Dense Linear Algebra (DLA) was among the first to realize, understand, and promote that standardizing (de facto) an interface to fundamental primitives supports portable high performance. For almost four decades, those primitives have been the Basic Linear Algebra Subprograms (BLAS) [Lawson et al. 1979;Dongarra et al. 1988Dongarra et al. , 1990. The expectation was and still remains that some expert will provide to the community a high-performance implementation of the BLAS every time a new architecture reaches the market. For this purpose, many hardware vendors currently enroll a numerical libraries group and distribute well-maintained libraries (e.g., Intel's MKL [Intel 2015], AMD's ACML [AMD 2015], and IBM's ESSL [IBM 2015] libraries), while over the years we have enjoyed a number of alternative Open Source solutions (e.g., GotoBLAS van de Geijn 2008a, 2008b] Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2016 ACM 0098-3500/2016/08-ART12 $15.00 DOI: http://dx.doi.org/10.1145/2925987 ATLAS [Whaley and Dongarra 1998]). Nevertheless, these solutions all require considerable effort in time and labor for each supported target architecture.
In order to reduce the exertion of developing high-performance implementations, ATLAS and its predecessor PHiPAC [Bilmes et al. 1997] introduced autotuning to the field of high-performance computing. The fundamental rationale behind these two libraries is that optimizing is difficult and time-consuming for a human expert and that autogeneration in combination with autotuning is the answer.
The work in Yotov et al. [2005] demonstrated that the ATLAS approach to optimizing the general matrix-matrix product (GEMM) can be, by and large, analytically modeled. Thus, it reveals that autotuning may be unnecessary for the operation that has been touted by the autotuning community as the example of the success of autotuning. The problem with that work ( [Yotov et al. 2005]) is that ATLAS leverages an inner kernel, optimized by a human expert, which still involves a substantial manual encoding. Precisely, the presence of this black-box inner kernel makes analytical modeling problematic for ATLAS, since some of the tuning parameters are hidden inside.
GotoBLAS (now supported as OpenBLAS) also builds on a substantial inner kernel that is implemented by a human, turning the analysis equally difficult without a complete understanding of the complex details embedded into the inner kernel. Until recently, no paper that detailed the intricacies of the GotoBLAS inner kernel existed. BLIS (BLAS-like Library Instantiation Software) [Van Zee and van de Geijn 2015] is a new framework for instantiating the BLAS. A key benefit of BLIS is that it is a productivity multiplier, as the framework allows developers to rapidly unfold new high-performance implementations of BLAS and BLAS-like operations on current and future architectures . From the implementation point of view, BLIS modularizes the approach underlying GotoBLAS2 (now adopted also by many other implementations) to isolate a microkernel that performs GEMM upon which all the level-3 BLAS functionality can be built. Thus, for a given architecture, the programmer only needs to develop an efficient microkernel for BLIS, and adjust a few key parameter values 1 that optimize the loops around that microkernel, to automatically instantiate all the level-3 BLAS. Importantly, BLIS features a layering that naturally exposes the parameters that need to be tuned.
While BLIS simplifies the implementation of BLAS(-like) operations, a critical step to optimize BLIS for a target architecture is for the developer to identify the specific parameter values for both the microkernel and the loops around it. A conventional solution would be to apply empirical search. In this article, we adopt an alternative model-driven approach to analytically identify the required parameter values for BLIS. In particular, we apply an analysis of the algorithm and architecture, similar to that performed in Yotov et al. [2005], to determine the data usage pattern of the algorithm ingrained within the BLIS framework, and to build an analytical model based on hardware features in modern processor architectures.
The specific contributions of this work include the following: -An analysis of the best known algorithm. We address the algorithm underlying BLIS (and therefore GotoBLAS and OpenBLAS) instead of the algorithm in ATLAS [Whaley et al. 2001]. This is relevant because it has been shown that, on most architectures, a hand-coded BLIS/GotoBLAS/OpenBLAS implementation often yields higher performance than an implementation automatically generated via AT-LAS , even if ATLAS starts with a hand-coded inner kernel. This article is the first work that analyzes the interaction between hardware and algorithm underlying BLIS/GotoBLAS/OpenBLAS in detail.
-A more modern model. We accommodate a processor model that is more representative of modern architectures than that adopted in Yotov et al. [2005] and Whaley et al. [2001]. Concretely, our model is more general than previous models as it allows for modern hardware features such as a vector register file and a memory hierarchy with multiple levels of set-associative data caches. This considerably improves upon the analytical model in Yotov et al. [2005] that considered one level of cache and ignored vector instructions. -A more comprehensive model. Our analytical model is more comprehensive in that it includes the parameter values that also characterize the microkernel. These are values that a developer would otherwise have to determine empirically. This is important because the best implementations provided by ATLAS often involve loops around a hand-coded (black-box) kernel. Since the internals of the hand-coded kernel are not known, the model in Yotov et al. [2005] was not able to identify the parameter values for the global algorithm. -A more actionable model. Our analytical models are meant to provide actual values that can be used by the developer to customize the BLIS framework for a specific architecture. Thus, it differs from existing efforts such as those described in Whaley et al. [2001] where "models" are used as heuristics to guide the empirical search engine. -Competitive with manually tuned implementations. Previous work [Yotov et al. 2005;Kelefouras et al. 2014] showed that analytical models could produce code implemented in C that is competitive with implementations generated via empirical search. However, a weakness of the previous work is that it does not produce code that is competitive with implementations that use hand-coded inner kernels. Our article shows that analytically obtained parameter values can yield performance that is competitive with manually tuned implementations that achieve best-in-class performance. Hence, this article provides an answer to the question posed in Yotov et al. [2005]-that is, whether empirical search is really necessary in this context-by demonstrating that analytical modeling suffices for high-performance BLIS implementations, which then shows that careful layering combined with analytical modeling is competitive with GotoBLAS, OpenBLAS, and vendor BLAS (since other papers show BLIS to be competitive with all these implementations).
In this article, we restrict ourselves to the case of a single-threaded implementation of GEMM in double precision. How to determine optimal parameter values for a multithreaded implementation is orthogonal to this article, and is at least partially answered in .

BLIS IMPLEMENTATION OF GEMM
The GEMM algorithm embedded within the BLIS framework is described in Van Zee and van de Geijn [2015], and the same approach is instantiated in BLAS libraries optimized by DLA experts such as GotoBLAS van de Geijn 2008a, 2008b] and its successor, OpenBLAS [OpenBLAS 2015]. Importantly, BLIS explicitly exposes three loops within the inner kernel used by the GotoBLAS and OpenBLAS, which then facilitates the analytical determination of the parameters. For completeness, we provide a brief overview of the algorithm next. In addition, we identify the data usage pattern and highlight the parameters that characterize the algorithm.

A High-Performance Implementation of GEMM
Without loss of generality, we consider the special case of the matrix-matrix multiplication, C := AB+ C, where the sizes of A, B, C are m× k, k × n, and m× n, respectively. In the following elaboration, we will consider different partitionings of the m-, n-, and k-dimensions of the problem. For simplicity, when considering a (vertical or horizontal)  partitioning of one of the problem dimensions, say q, into panels/blocks of size (length or width) r, we will assume that q is an integer multiple of r.
BLIS implements GEMM as three nested loops around a macrokernel, 2 plus two packing routines. The macrokernel is implemented in terms of two additional loops around a microkernel. The microkernel is a loop around a rank-1 (i.e., outer product) update, and it is typically implemented in assembly or with vector intrinsics. In BLIS, the remaining five loops are implemented in C.
Pseudocode for the GEMM algorithm is given in Figure 1. The outermost loop (Loop 1, indexed in the figure by j c ) traverses the n-dimension of the problem, partitioning both C and B into column panels of width n c . The second loop (indexed by p c ) processes the kdimension, partitioning A into column panels of width k c and the current panel of B into blocks of length k c . The third loop (indexed by i c ) iterates over the m-dimension, yielding partitionings of the current column panels of C and A into blocks of length m c . The following loops comprise the macrokernel. Let us define C c = C(i c : i c + m c − 1, j c : j c + n c −1). Loop 4 (indexed by j r ) traverses the n-dimension, partitioning C c and the packed block B c into micropanels of width n r . Loop 5 (indexed by i r ) then progresses along the m-dimension, partitioning the current micropanel of C c into microtiles of length m r and the packed block A c into micropanels of the same length. The innermost loop (Loop 6, indexed by p r ), inside the microkernel, computes the product of a micropanel of A c and micropanel of B c as a sequence of k c rank-1 updates, accumulating the result to a microtile of C c , which can be described mathematically as C c (i r : i r +m r −1, j r : j r +n r −1) += A c (i r : i r +m r −1, 0 : k c −1)· B c (0 : k c −1, j r : j r +n r −1). At this point, it is worth emphasizing the difference between C c and A c , B c . While the former is just a notation artifact, introduced to ease the presentation of the algorithm, the latter two correspond to actual buffers that are involved in data copies.

The Importance of Packing
It is generally known that accessing consecutive memory locations (also known as in-stride access) is usually faster than nonconsecutive memory accesses. By packing elements of A in Loop 3 and B in Loop 2 in a special manner into A c and B c , respectively, we can ensure that the elements of these two matrices will be read with in-stride access inside the microkernel [Henry 1992].
Concretely, each m c × k c block of A is packed into A c . Elements are organized as row micropanels of size m r ×k c , and within each micropanel of A c , the elements are stored in column-major order. Each k c × n c block of B is packed into B c . In this case, elements are packed into column micropanels of size k c × n r , and each column micropanel is stored into row-major order. The reorganization of the entries of A and B into blocks of A c and B c with the packing layout illustrated in Figure 2 ensures that these elements are accessed with unit stride when used to update a microtile of C. Packing A c and B c also has the additional benefit of aligning data from A c and B c to cache lines boundaries and page boundaries. This enables to use instructions for accessing aligned data, which are typically faster than their nonalignedcounterparts.
A third benefit of packing is that data from A c and B c are preloaded into certain cache levels of the memory hierarchy. This reduces the time required to access the elements of A c and B c when using them to update a microtile of C. Since A c is packed in Loop 3, after B c has been packed in Loop 2, the elements of A c will likely be higher up in the memory hierarchy (i.e., closer to the registers) than those of B c . By carefully picking the sizes for A c and B c , the exact location of A c and B c within the memory hierarchy can be determined.

Data Usage Pattern
Consider Loop 6. This innermost loop updates a microtile of C, say C r , via a series of k c rank-1 updates involving a micropanel A r , from the packed matrix A c , and a micropanel B r , from the packed matrix B c . At each iteration, m r elements from A r and n r elements from B r are multiplied to compute m r n r intermediate results that will be accumulated into the microtile of C. Between two iterations of Loop 6, different elements from A r and B r are used, but the results from the rank-1 updates are accumulated into the same microtile. Hence, the microtile C r is the data reused between iterations (Reused Microtile, C r m r × n r k c 5 Micropanel In addition, because k c rank-1 updates are performed each time Loop 6 is executed, C r is said to have a reuse factor of k c times. Since the columns of A r and the rows of B r are used exactly once each time through Loop 6, there is no reuse of these elements.
Identifying pieces of data that are reused is important because it tells us which data portions should be kept in cache in order to allow fast access and reduce cache trashing. An analysis similar to that with Loop 6, performed on the remaining loops to identify the data that is reused in each one of them, the size of the reused data, and the number of times the data is being reused, yields the results summarized in Table I.
Notice that the size of the reused data becomes smaller as the loop depth increases. This nicely matches the memory hierarchy, where faster memory layers (caches) feature lower capacity than slower caches. As such, GEMM maps smaller reused data to faster layers in the memory hierarchy, as depicted in Figure 3.

Characteristic Parameters that Drive Performance
As described in the previous subsection and captured in Table I, the GEMM algorithm that underlies BLIS is characterized by the following five parameters: m c , k c , n c , m r , and n r , which correspond to the block/panel/tile size for each one of the loops around the microkernel. In addition, this set of parameters also determines the dimension of the reused data and the reuse factor of each of the loops.
Loop 6 is defined by three of the five parameters (m r , n r , and k c ), while Loops 3, 4, and 5 are characterized by four of the five parameters. In addition, three of the four parameters that characterize Loops 3, 4, and 5 carry beyond to the next inner loop as well. Now, when the parameters that characterize Loop 5 have been identified, the only unknown parameter for Loop 4 is n c . This observation suggests that the parameters should be identified from the innermost loop outwards. In the next section, we leverage this observation to optimize this collection of parameters analytically.

MAPPING ALGORITHM TO ARCHITECTURE
An effective mapping of the microkernel and the loops around it to the architecture is fundamental to attain high performance with the BLIS approach. This means that we need to identify values for the characteristic parameters that are tuned for the target machine architecture. As discussed in the previous section, we start by mapping the microkernel to the architecture and work our way outward in order to identify the values for the different characteristic parameters. In mapping the algorithm to the architecture, we faithfully adhere to the loop structures shown in Figure 1. This is because, with the exception of Loop 6 (i.e., the microkernel), the order for the remaining loops is hard-coded within the BLIS framework. This implies that traditional loop transformations, such as loop permutations/interchange and unroll-and-jam, are not considered in our approach.

Architecture Model
To develop an optimization model for mapping the GEMM algorithm onto an architecture, we first need to define a model for our target architecture. We make the following assumptions regarding our hypothetical processor: -Load/store architecture and vector registers. Data [Whaley et al. 2001;Yotov et al. 2005].
For simplicity, we assume that the size of a cache line is the same for all cache levels. -Bandwidth. It is assumed that bandwidth between caches and register is not a bottleneck in the computation.

Parameters for the Body of the Innermost Loop: m r and n r
Recall that the microkernel is characterized by three parameters: m r , n r , and k c , where the former two determine the size of the microtile of C that is reused across every iteration. In addition, these two parameters also define the number of elements of A r and B r that are involved in the rank-1 update at each iteration of Loop 6. In this subsection, we discuss how m r and n r can be identified analytically.
3.2.1. Strategy for Finding m r and n r . The main strategy behind our approach to identify m r , n r is to choose them "large enough" so that no stalls due to the combination of dependencies and latencies are introduced in the floating-point pipelines during the repeated updates of the microtile C r . In addition, the smallest values of m r , n r that satisfy this condition should be chosen because this implies that C r occupies the minimum number of registers, releasing more registers for the entries of A r , B r . This in turn enables larger loop unrolling for the innermost loop (Loop 6) and more aggressive data preloading [Hennessy and Patterson 2003] to reduce loop overhead.

Latency of Instructions.
Consider the k c successive updates of C r occurring in Loop 6. In each of the k c iterations, each element of C r is updated by accumulating the appropriate intermediate result to it (obtained from multiplying corresponding elements of A r and B r ). This update can be achieved with a VFMA instruction for each element of C r in each iteration. Now, recall that there is a latency of L VFMA cycles between the issuance of a VFMA instruction and the issuance of another dependent VFMA instruction. This implies that L VFMA cycles must lapse between two successive updates to the same element of C r . During those L VFMA cycles, a minimum of N VFMA L VFMA VFMA instructions must be issued without introducing a stall in the floating-point pipelines. This means that, in order to avoid introducing stalls in the pipelines, at least N VFMA L VFMA N VEC output elements must be computed. Therefore, the size of the microtile C r must satisfy m r n r ≥ N VEC L VFMA N VFMA . (1) 3.2.3. Maximizing Register Usage. Ideally, m r should be equal to n r as this maximizes the ratio of computation to data movement during the update of C r in Loop 6 (2m r n r k c flops versus 2m r n r + m r k c + n r k c memory operations). However, there are two reasons why this is not always possible in practice. First, m r and n r have to be integer values as they correspond to the dimensions of the microtile C r . Second, it is desirable for either m r or n r to be integer multiples of N VEC . As the number of registers is limited in any architecture, it is necessary to maximize their use, and choosing m r (or n r ) to be an integer multiple of N VEC will ensure that the registers are filled with elements from A r , B r , and C r such that data not used in a particular iteration is not kept in registers.
Therefore, in order to satisfy this criterion as well as Equation (1), m r and n r are computed as follows: and The astute reader will recognize that one could have computed n r before computing m r . Doing this is analogous to swapping the values of m r and n r , and also yields a pair of values that avoid the introduction of stalls in the pipeline. We choose the values of m r and n r that maximize the value of k c , which will be discussed in the following section.

Parameters for the Remaining Loops: k c , m c , and n c
After analytically deriving values for m r , n r , we discuss next how to proceed for k c , m c , and n c .
3.3.1. Strategy for Identifying k c , m c , and n c . We recall that k c , m c , and n c (together with n r ) define the dimensions of the reused data for Loops 3-5 (see Table I). Since the reused blocks are mapped to the different layers of the memory hierarchy, this implies that there is a natural upper bound on k c , m c , and n c , imposed by the sizes of the caches. In addition, because the reused data should ideally be kept in cache between iterations, the cache replacement policy and the cache organization impose further restrictions on the values for k c , m c , and n c .
In the remainder of this section, we describe our analytical model for identifying values for these characteristic parameters. For clarity, the discussion will focus on identifying the analytical value for k c . The values for m c and n c can be derived in a similar manner.

Keeping B r in the L1 Cache.
Recall that the micropanel B r is reused in every iteration of Loop 5. As such, it is desirable for B r to remain in the L1 cache while Loop 5 is being executed. Since the L1 cache implements an LRU replacement policy, conventional wisdom suggests choosing the sizes of the different parts of the matrices such that data required for two consecutive iterations of the loop are kept in cache. This implies that the cache should be filled with B r , two micropanels of A c , and two microtiles of C c . The problem with this approach is that keeping two micropanels of A c and two microtiles of C c consequently reduces the size of the micropanel B r that fits into the L1 cache, which means that the value k c is smaller, hence decreasing the amount of data that could possibly be reused. In addition, k c is the number of iterations for Loop 6, and reducing this value also means less opportunities to amortize data movement with enough computation in the microkernel.
Instead, we make the following observations: (1) At each iteration of Loop 5, a micropanel A r from A c and the reused micropanel B r are accessed exactly once. This implies that the micropanel that is loaded first will contain the least-recently-used elements. Therefore, because of the LRU cache replacement policy, as long as B r is loaded after A r , the entries from A r will be evicted from the cache before those from B r .
(2) Since each micropanel A r is used exactly once per iteration, it is advantageous to overwrite the entries of the micropanel A prev r that was used in the previous iteration with the corresponding entries of the new A next r that will be used in the present iteration. Doing so potentially allows a larger B r to fit into the cache, hence increasing the amount of data being reused.
3.3.3. Evicting A prev r from Cache. We assume that within the microkernel (Loop 6), the elements from B r are loaded after those of A r . From the preceding observation (2), a strategy to keep a large micropanel B r in cache is to evict the old micropanel A prev r from the cache, loaded in the previous iteration of Loop 5, by replacing its entries with those of the new micropanel A next r to be used in the current iteration of the loop. To ensure this, we need to enforce that the same entries of all micropanels of A c are mapped to the same cache sets. Since the L1 cache comprises N L1 sets, then the memory addresses that are N L1 C L1 bytes apart will be mapped to the same set in the L1 cache. This implies that the corresponding elements of consecutive micropanels of A c must lie in memory an integer multiple (say C A r ) of N L1 C L1 bytes apart.
Recall that consecutive micropanels of A c are packed contiguously in memory, and each micropanel of A c contains exactly m r × k c elements. This means that the distance between the same entries of two consecutive micropanels of A c must be m r k c S DATA bytes. Therefore, A prev r will be replaced by A next r only if Rearranging the preceding expression yields the following expression for k c : which ensures that a newly read micropanel of A c is mapped to the same set as the existing micropanel of A c . Thus, solving for k c in (4) is equivalent to finding C A r , since the remaining variables in the equation are hardware parameters.
3.3.4. Finding C Ar . The astute reader will recognize that C A r is the number of cache lines taken up by a micropanel A r in each set of the L1 cache. Similarly, we can define C B r as the number of cache lines in each set dedicated to the micropanel B r . In order for B r to remain in a W L1 -associative cache, it is necessary that In practice, the number of cache lines filled with elements of A r and B r has to be strictly less than the degree of associativity of the cache. This is because at least one cache line must be used when the entries of a microtile C r are loaded into the registers. Since C r is not packed, it is possible that its entries are loaded into the same set as the entries of A r and B r . Hence, we update the previous expression to account for the loading of C r as follows: (5) As the micropanel of B c is packed into n r k c S DATA contiguous memory, we can compute C B r as follows: Therefore, replacing this expression in the inequality in Equation (5), we get which suggests choosing C A r as large as possible and then allows us to solve for k c .
3.3.5. Two-way Set Associativity. Our value for k c was chosen to enforce A next r to be loaded into those cache lines previously occupied by A prev r . To achieve this effect, we made two assumptions: -Each set in the cache has to be filled equally with the micropanel of A r ; that is, the number of cache lines in each set containing elements from A r is the same across all sets. This assumption ensures that the corresponding elements of different micropanels will be assigned to the same cache set. -One cache line in each set of the cache is reserved for the elements of C r so that loading them will not evict the micropanel B r that already resides in cache.
The problem with a two-way set associative cache is that satisfying these two assumptions would imply that there remain no available cache lines to hold the elements of B r , precisely the block that we want to keep in cache.
However, if the size of A r is N L1 C L1 /k, where N L1 is an integer multiple of k, then the micropanel of A r that is loaded in the (k + 1)-th iteration will be mapped to the same cache sets as the first micropanel of A r . When k = 2, this is identical to keeping two iterations worth of data in the cache, which ensures that the micropanel of B r is kept in cache. Any larger value of k decreases the size of the micropanel of A r , which implies that k c is reduced. Therefore, which implies that k c is given by the formula when the cache is two-way set associative.

VALIDATION
In this section, we compare the values derived via our analytical model with those employed by implementations that were either manually tuned by experts or empirically tuned. Unlike previous work [Yotov et al. 2005;Kelefouras et al. 2014] that compared performance against an implementation based on ATLAS, we chose to compare our parameter values against the parameter values from manually optimized implementations using the BLIS framework. As the algorithms used are identical, any deviation is the result of a difference in the parameter values. For this experiment, we chose the following mixture of traditional (x86) architectures and low-power processors: Intel Dunnington (X7660), Intel SandyBridge (E3-1220), AMD Kaveri (A10-7850K), and Texas Instruments C6678. This sample includes both dated (end-of-life) and recent architectures in order to evaluate the accuracy and validity of the model, while taking into account different trends in processor architecture design. We consider double precision real arithmetic to obtain our parameter values. Table II lists the machine parameters necessary for computing m r and n r , the values analytically derived for m r and n r using the model, and the values of m r and n r chosen by the expert when implementing a microkernel using the BLIS framework. In addition, we include the expert's values for the microkernel underlying OpenBLAS. The instruction latencies for the different architectures were obtained from the vendors' instruction set/optimization manuals. In these cases where the architecture did not include an actual VFMA instruction, we computed L VFMA by adding the latencies for a floating-point multiply and a floating-point addition. Note that our analytical values match those chosen by the BLIS experts for all the architectures. Furthermore, the values for m r and n r used in OpenBLAS also match our analytical values in two out of three cases. 4 We note with interest that OpenBLAS utilized an 8 × 2 microkernel for the AMD Kaveri. While it differs from our analytical 4 × 6 microkernel, we note that AuGEM [Wang et al. 2013], an empirical search tool developed by the authors of OpenBLAS in order to automatically generate the microkernel, generated a microkernel that operates on a 6 × 4 microtile. This suggests that the original 8 × 2 microkernel currently used by OpenBLAS may not be optimal for the AMD architecture.

Evaluating the Model for m r and n r
The results provide evidence that our analytical model for m r and n r is reasonably robust across a variety of architectures. Table III presents the machine parameters to derive k c and m c using the model, the optimal analytical values, and the empirical values adopted in BLIS after a manual optimization process (inside parentheses). We do not report results for n c because most of these processor architectures do not include an L3 cache, which means that n c is, for all practical purposes, redundant. For the SandyBridge processor, there was minimal variation in performance when n c was modified.

Evaluating the Model for k c and m c
Again, our analytical model yields similar values if not identical for both k c and m c . The analytical model offered the same values the expert picked when optimizing for k c .
We expected to encounter more variation between the expert-chosen values and the manually tuned ones for the m c parameter. This is because most L2 caches are unified (i.e., they contain both instructions and data), which makes predicting the cache replacement behavior of the L2 cache more difficult, as it depends on both the amount of data and instructions in that level of the cache. Nonetheless, we note that, with the exception of the AMD Kaveri, the values computed by our analytical model for m c were similar if not identical to those chosen by the expert.

Experimental Validation
In this subsection, we aim to assess the accuracy of the analytical model in more detail. Our goal is twofold: first, we intend to analyze the behavior of BLIS from the performance and cache miss ratio perspectives on an architecture for which the analytical model exactly predicts the empirical optimal values for m c and k c (Intel SandyBridge); and second, we will quantify and explain the deviations in performance for an architecture for which the analytical values do not match the observed ones (AMD Kaveri).
4.3.1. Empirical Validation of the Analytical Model. The Intel SandyBridge Case. We experimentally measure the behavior of the L1 cache misses/GFLOPS rate against the parameter k c . Figure 4 shows the outcome of this evaluation on the Intel SandyBridge, for three different values of m c : the analytical/empirical optimal (m c = 96) and two values, one below and one above the optimal (m c = 32 and m c = 512, respectively). Let us relate the size of the blocks in the L1 cache to the performance lines and the compulsory (also known as cold), conflict, and capacity misses [Hennessy and Patterson 2003]. Consider the m c -optimal case (green line) first. We can clearly identify three ranges or intervals in the results for the L1 cache misses (right-hand side plot): -k c < 48 (i.e., k c = 16 or 32). For these two small values, a complete block A c (m c × k c ) fits into the L1 cache together with a single micropanel B r (k c × n r ). In this scenario, the volume of conflict/capacity misses is negligible, and most of L1 cache misses are compulsory, occurring when A c is packed by the corresponding routine and each B r is loaded from the microkernel. (The number of L1 cache misses during the packing of B c is negligible as well.) We emphasize that no L1 cache misses are expected in the access to any of the micropanels A r from the microkernel, as they are all part of A c and this block already resides in the L1 cache after it is packed. The repeated use of the micropanels A r from within Loop 4 explains why this yields such a low rate of L1 cache misses. On the other hand, the low value of k c does not allow one to hide the cost of transferring the data from the cache, and produces the low GFLOPS rate for these two particular values. -48 ≤ k c ≤ 256. This range of values for k c implies that A r and B r occupy up to 16 and 8 KBytes, respectively (i.e., for the largest value of the range, k c = 256). Therefore, no capacity misses are to be expected in the L1 cache. Furthermore, because of the eightway set-associative configuration of the L1 cache in this architecture, the carefully aligned packing of BLIS will ensure that A r occupies up to C A r = 4 lines per set of the L1 cache, B r up to C B r = 2 lines/set, and the remaining 2 lines/set are left "empty" for entries of C c . Therefore, no significant amount of conflict misses is to be expected. The L1 cache misses are mostly compulsory, and originate when the micropanels A r and B r are loaded from within the microkernel. (This time, the volume of L1 cache misses during the packing of both A c and B c is negligible.) Indeed, most of the misses come from the load of A r , as the block A c does not fit into the L1 cache, and has to be repeatedly accessed for each iteration of Loop 4. We also note that the number of misses is almost constant for the full range of values. This is because A c is streamed by the microkernel, from the L2 cache through the L1 cache, approximately n c /n r times within Loop 4, independently of the actual value of k c . From the point of view of performance, the best option occurs at k c = 256, because this value is in the range that matches the architecture configuration, and is the largest one so that it better amortizes the overheads of packing and overlaps data transfers from the L1/L2 cache with computation. -k c > 256. As k c exceeds the analytical value, conflict misses first and capacity misses later dominate, resulting in lower performance.
Three intervals can be distinguished as well for the suboptimal case m c = 32 (blue line): k c < 144, 144 ≤ k c ≤ 256, and k c > 256. The first interval has the same cause as in the preceding optimal scenario (i.e., k c < 48 and m c = 96), but given that the value of m c is now 3× smaller than the optimal, the value of k c that yields a block A c that fits into the L1 cache is multiplied by 3, yielding the upper threshold k c < 144 for the first interval. Now, given that the second interval is defined by the dimensions of the micropanels A r and B r that fit into the L1 cache, it is not surprising that the upper threshold is defined by k c = 256 when m c = 32 as well, since m c plays no role in the dimension of these two micropanels. Thirdly, the higher amount of L1 cache misses of this suboptimal case when compared to the optimal one, with k c ≥ 144 is explained because the number of times that A c is repacked and repeatedly accessed is proportional to m/m c .
The same comments of the previous case apply to the suboptimal case m c = 512 (red line), with the only exception of the high number of L1 cache misses when k c is very small. The reason is that, for this large value of m c , the packing of A c produces this effect.
4.3.2. Quantification of the Analytical Model Deviations. The AMD Kaveri Case. Let us analyze the discrepancy between the analytical and empirical values for m c on the AMD Kaveri.
In particular, the model fixes the analytical value at m A c = 1,792, while the experimentation found the empirical value to be m E c = 1, 368. Unfortunately, BLIS enforces that m c is an integer multiple of both m r and n r (on this architecture, 4 and 6, respectively) so that the analytical value cannot be experimentally tested. To avoid exceeding the capacity of the cache, in the following analysis we will thus consider the analytical value atm A c = 1,788, which corresponds to the closest integer multiple of both 4 and 6 smaller than m A c . Figure 5 relates the performance (GFLOPS) and L2 cache misses on the AMD Kaveri, using the analytical value k c = 128, for a wide range of values for m c that include both m E c andm A c . The first aspect to note in the right-hand side plot is that the analytical suboptimal correctly determines the dimension from which the number of L2 cache misses initiates an exponential growth due to capacity constraints (specifically, this occurs from m c ≥ 1, 800 > m A c >m A c ). Compared with the empirical value, the casem A c increases the L2 cache misses by 8.85% which, in principle, seems a large value but cannot be appreciated in the plot due to the scale. However, relative to the maximum number of cache misses in the plot (i.e, normalized against the misses for m c = 2,016), the misses form A c only incur an increase of 0.78% with respect to the figure for m E c . With these parameters, the slightly larger number of L2 cache misses results in a small decrease of performance: from 23.95 GFLOPS for m E c to 23.48 GFLOPS form A c , that is, −1.97%.
Let us consider now the empirical suboptimalm E c = 1,656, which corresponds to the largest possible value of m c that is still within the "noise" of the best performance. From a practical point of view, the difference in performance between m E c andm E is negligible (23.946 GFLOPS for the latter, which represents a decrease of −0.017% with respect to m E c and can be considered to be in the noise level). Taking into account the dimensions of A c , form E c this block occupies 1,656KBytes, that is, 85.55% of the L2 cache or, in this 16-way associative cache, close to 13 lines of each 16-line set (the exact value is 12.93). We can observe that this is not far from the analytical optimal m A c , which dictates that A c should occupy 87.50% of the L2 cache, or 14 lines of each 16-line set. The conclusion from this analysis is that the discrepancies between the analytical valuesm A c , m A c and the empirical valuesm E c , m E seem to be due to problems with the alignment of A c in the 16-way associative cache, probably related to the packing of A c producing unexpected conflict misses, and not to a deviation in the analytical model. A second source of alignment problems may be that, for this architecture, n r (=6) is not a power of 2. Therefore, it may occur that each micropanel of B c does not overwrite the previous micropanel of this block, in turn causing conflict misses since the streaming of this matrix does not overwrite the micropanel of B in the L2 cache.

CONCLUSION AND FUTURE DIRECTIONS
We have developed an analytical model that yields optimal parameter values for the GEMM algorithm that underlies BLIS as well as other manually tuned high-performance dense linear algebra libraries. The key to our approach is the recognition that many of the parameters that characterize BLIS are bounded by hardware features. By understanding how the algorithm interacts with the processor architecture, we can choose Fig. 5. GFLOPS and L2 cache misses (top and bottom plots, respectively) on the AMD Kaveri, for a matrix multiplication with all three operands square of dimension m = n = k = 4,000, the optimal k c = 128, and varying m c in steps of 24. The blue and green markers, respectively, identify the empirical optimal value m E c = 1,368 and the analytical suboptimalm A c = 1,788.
parameter values that leverage, more aggressively, hardware features such as the multilayered cache memory of the system. We compare the values obtained via our analytical approach with the values from manually optimized implementations (e.g., by the experts behind OpenBLAS, an opensource project to which none of the authors has contributed to), and report that they are similar if not identical. Unlike similar work that compares with ATLAS, this is the first article that demonstrates that an analytical approach to identifying parameter values can be competitive with best-in-class, expert-optimized implementations. This