Energy-aware strategies for task-parallel sparse linear system solvers

Summary We present several energy-aware strategies to improve the energy efficiency of a task-parallel preconditioned Conjugate Gradient (PCG) iterative solver on a Haswell-EP Intel Xeon. These techniques leverage the power-saving states of the processor, promoting the hardware into a more energy-efficient C-state and modifying the CPU frequency (P-states of the processors) of some operations of the PCG. We demonstrate that the application of these strategies during the mainoperationsoftheiterativesolvercanreduceitsenergyconsumptionconsiderably,especially for memory-bound computations.

Algorithmic formulation of the PCG method. Here, max is an upper bound on the relative residual for the computed approximation to the solution exploring different new approaches (using both P-states and C-states) to save energy in the task-parallel version of ILUPACK's PCG on an Intel Xeon Haswell-EP processor. In order to yield an energy-efficient execution, we have analyzed the following strategies.
• We explore the benefits of the race-to-halt/race-to-idle (RTH) strategy, which reduces the energy-consumption of ILUPACK transforming the busy-wait behavior of idle threads into an idle-wait.
• In addition, we leverage the iterative nature of the method to progressively adjust the frequency of the processor cores in order to reduce idle periods and harvest energy, implementing a dynamic VFS (DVFS)-based approach. 11 In rough detail, this strategy detects the threads, which experience longer idle periods, and adjusts the best P-state for the execution of each task in order to reduce the waiting time of these threads. The DVFS technique has been successfully applied in the works of Wang et al, 12 Alonso et al, 13 and Kolodziej et al. 14 • We also implement a memory-bound aware (MBA) methodology that adjusts the P-state of all threads taking into account the energy consumption instead of the execution time. This technique is useful to adapt the frequency of the cores to the current arithmetic intensity of the executed codes, reducing the energy consumption. We will show that there are several alternatives to achieve the objective of minimizing energy consumption.
All these techniques have been combined to analyze their impact on the implementation of the OpenMP task-parallel version of ILUPACK, showing relevant energy gains on the Haswell-EP processor, particularly when applying the MBA technique.
The rest of this paper is structured as follows. In Section 2, we introduce the multilevel preconditioned iterative solver in ILUPACK, and its task-parallel variant. In Section 3, we explain the different energy-aware strategies implemented to improve the energy efficiency in the OpenMP task-parallel version of ILUPACK. We evaluate the impact on the application of these strategies in Section 4. Finally, we close this paper with a few concluding remarks in Section 5.

Sequential ILUPACK
Consider the linear system Ax = b, where A∈ R n×n is a sparse coefficient matrix, b ∈ R n is the right-hand side vector, and x ∈ R n is the sought-after solution. The solution of these kinds of systems can be performed using ILUPACK, a software library, written in C and Fortran, for the iterative solution of large sparse linear systems. For symmetric positive definite (SPD) linear systems, the PCG method implementation in ILUPACK integrates an "inverse-based approach" into the ILU factorization of matrix A, in order to obtain an efficient preconditioner (A ≈ LL T = M). This method combines dropping with pivoting to bound the norm of the inverse triangular factor L, obtaining a numerical multilevel hierarchy of partial inverse-based approximations. 15 The algorithmic description of the PCG method implemented in ILUPACK is presented in Figure 1. The first step of the solver is the computation of the preconditioner M (PrCo). The following iterative loop comprises a matrix-vector product (SpMV), the preconditioner application (PrAp), and several vector operations (Dot products, Axpy operations, 2-norm). The computation and application of the preconditioner centralize most of the computational cost of the solver. For this reason, in the remainder of this section, we mainly focus on these operations.
Computation of the preconditioner. ILUPACK relies on the commonly named inverse-based approach to compute the preconditioner. This approach enhances the robustness of classical Incomplete LDL T factorizations by restricting the growth of the entries in the inverses of the triangular factors. To verify this, consider the factorization where L is unit lower triangular matrix, D is diagonal, and R is the error matrix that accumulates those entries dropped during the factorization. The preconditioned matrix is obtained by applying the preconditioner M = LDL T on the original matrix.
Moreover, ILUPACK accommodates pivoting during the factorization to bound the norm of the inverse triangular factors, creating a multilevel hierarchy of partial inverse-based approximations (see Figure 2). When the multilevel method is employed over multiple levels, a cascade of factors is usually acquired. Besides, the computed multilevel factorization is adapted to the structure of the subjacent system. Consequently, in this case, the multilevel preconditioner can be formulated recursively, at a given level l, as where P defines the inverse-based permutation, L B , L F , and D B are blocks of the factors of the multilevel LDL T preconditioner (with L B unit lower triangular and D B diagonal), whereasP andD are, respectively, related to filling reduction and scaling matrices; and M l+1 is the preconditioner computed at level l + 1.
Application of the preconditioner. The application of the preconditioner in level l (ie, computing z ∶= M −1 l r) requires solving a system of linear After applying several transformations to the initial system (r ′ ∶= Dr andr ∶= P TPT r ′ ), we obtain a new system 16 which is solved for w(= P TPTD−1 z) in the following three steps.

Recursive step:
[ 3. Upper triangular solver (UpTrSv): Therefore, the PrAp in ILUPACK operates two sparse matrix-vector multiplications and solves two linear systems of the form Additionally, it performs three other types of operations: diagonal scaling, vector permutation, and vector updates of the form x ∶= a − b.

Task-parallel ILUPACK
Nested dissection. The parallel version of ILUPACK can be exposed by means of nested dissection orderings, which reveal parallelism exploiting the connection between sparse matrices and adjacency graphs. In particular, nested dissection algorithm partitions the adjacency graph G(A) associated to the approximate factorization of A into a hierarchy of vertex separators and independent subgraphs. 17 For example, in Figure 3, G(A) is partitioned after two levels of recursion into four independent subgraphs. This hierarchy can be constructed using METIS software, 18 which minimizes the size of the vertex separators balancing the size of the independent subgraphs. 19 Therefore, relabeling the nodes of G(A) according to the levels in the hierarchy produces a reordered matrix, A ← P T AP, with a structure sensitive to efficient parallelization. Concretely, the leading diagonal blocks of P T AP associated with the independent subgraphs can be first computed independently; after that, S (2,1) and S (2,2) can be processed in parallel, and finally, the separator S (1,1) is calculated. This type of concurrency can be expressed as a binary task-dependency graph (TDG), where the nodes represent simultaneous tasks and the edges dependencies among them.
Computation of the preconditioner. In order to design a task-parallel version of ILUPACK, we decouple the computation of the preconditioner into tasks, identifying the dependencies among them, and mapping the tasks to the execution nodes. For that purpose, the task-parallel version G A G (2,1) G (2,2) G (3,1) G (3,4) (3,2) G (3,3) G First ND level finds separator (1,1) Second ND level finds separators (2,1) and (2,2) ( (1,1) S (2,1) S (2,2) S (1,1) FIGURE 3 Nested dissection reordering. In this example, G(A) is partitioned into four independent subgraphs FIGURE 4 TDG of the diagonal blocks. We rename the nodes of the TDG, such that task T j is associated with block A jj in (4) exploits the connection between sparse matrices and adjacency graphs, 20 extracting parallelism via nested dissection, as explained before. Consider, for example, the TDG in Figure 4 such that ILUPACK partitions the coefficient matrix A into four independent submatrices. The reordered matrix (P T AP) can be decomposed in submatrices as follows: After this reorganization, the leading blocks of these four submatrices can be factorized in parallel, whereas the blocks of the other levels have dependencies with the ancestor tasks. These blocks will start the factorization when the ancestor tasks have finalized. This process continues traversing the TDG, until the root task factorizes its local submatrix.
Application of the preconditioner. As we stated in the previous section, this operation in ILUPACK lacks the solution of two triangular systems (lower and upper triangular factors). The TDG has to be traversed two times per solve z k+1 ∶= M −1 r k+1 at each iteration of the PCG. Hence, although the TDG associated with the first triangular system (LwTrSv) presents the same structure and dependencies as that related to the preconditioner computation, in the latter triangular solve (UpTrSv), the dependencies are reversed (from the root to the leaves). For that reason, the amount of parallelism expands/reduces as we progress toward/away from the leaves.
Other kernels in the PCG iteration. The vectors involved in the PCG are partitioned conformally to matrix A, see (4). Accordingly, the SpMV, Axpy, and Axpy-like kernels only operate with the data related to the leaves of the TDG. For example, for a matrix partitioned as in (4), the SpMV is decomposed into four matrix-vector products, so that the processing of each one of these four leaves is totally independent from the others. The Mapping tasks to cores. In this paper, we explore a data-flow version of ILUPACK using an ad-hoc runtime based on OpenMP, 21  . These traces show that, even with 4× more leaf nodes than threads, there still appear significant idle times due to the implicit barrier at the end of parallel regions for PrCo, SpMV, LwTrSv, and UpTrSv. This behavior motivates the development of different approaches to save energy, which are described in the next section.

ENERGY-AWARE TECHNIQUES ON ILUPACK
The reduction of the energy consumption in ILUPACK can be tackled tining the C-states and P-states using different approaches. In this section, we consider three energy-aware strategies.

Race-to-halt/race-to-idle
This strategy reduces the energy-consumption during the concurrent execution of ILUPACK transforming the "busy-wait" behavior (threads are polling until a new task is available) to "idle-wait" (threads are blocked until a new task is ready). Benefits should be obtained because the operating system promotes the hardware into a more energy-efficient C-state. • Red areas indicate an implicit OpenMP barrier at the end of parallel regions.
• Blue areas denote idle time related to task queues management.
The length of the first type can be reduced by assigning the value PASSIVE to the environment variable OMP_WAIT_POLICY. In this way, PCG operations where the computation only affects the leaf nodes (SpMV and vector operations) are optimized. For the second type, the solution is to synchronize the access to the task queues using mutex and condition variables. This solution can be also applied to the operations that traverse the TDG, that is, PrCo and PrAp .

Dynamic VFS
The main objective of this strategy is to apply a frequency-tuning policy, reducing the waiting time as soon as possible. To complete this, the code integrates a mechanism to detect which threads experience waits. Then, it adjusts the best P-state for the execution of each task in order to reduce the waiting time of those threads. Taking into account that the execution time of the tasks has to be measured several times to take the best decision, it does not make sense to apply this technique to the preconditioner, but it can be perfectly included in the PCG iteration. The overhead related to this technique suggested to apply it only to the most expensive operations of the iterative solver: SpMV, LwTrSv, and UpTrSv.
In Figure 6, it is easy to identify the slowest thread of the SpMV, but its calculation requires to measure the computational time of each task (time(th)) and, then, to locate the thread with the largest execution time (th slw ). During this computation, the initial P-state of all the tasks is P0 and, after the thread th slw is identified, the remaining threads increase the P-state of its last executed task. Afterwards, the execution time of each thread is measured again. If the measured value for a thread is greater than the current execution time for th slw , the P-state of its last task is decreased, and the thread is removed from the procedure. Otherwise, a new increment on the P-state of its last task is made. When the last task of a thread reaches the maximum P-state and a new increment is due, the next-to-last task has to be modified; in fact, the detection of the last task of a thread always excludes the tasks whose P-state has arrived to the maximum. Figure 7 shows an algorithmic formulation of the described process.
The SpMV computation only involves leaf nodes. Therefore, the implementation of the DVFS approach is direct. In contrast, LwTrSv and UpTrSv also involve intermediate nodes. Taking into account that the weight of these nodes is relatively small in the global computation, intermediate nodes related to both operations are not considered in this strategy. In LwTrSv, the execution of the DVFS approach is quite stable because all the threads start the processing of the leaf nodes at the same time, whereas the technique ensures that all the threads finalize their computation at about the same time, reducing the idle time (blue area in the traces). On the other hand, the beginning of the leaf nodes in UpTrSv is unknown because the execution of the intermediate nodes can be delayed in some scenarios (see Figure 8). Hence, it is more complex to adjust the finalization of the threads in this operation. Moreover, the execution of the leaf and intermediate nodes can be interleaved during the execution of the PCG, yielding more complex scenarios. In any case, the DVFS approach is also useful in UpTrSv, reducing the waiting time related to the implicit barriers (red area in the traces). Figure 8 shows how the trace changes when the DVFS approach is applied to PCG.

Memory-bound aware
The energy-aware implementations differ for CPU-bound and memory-bound operations. For the first ones, the best option is usually to execute the code using the highest CPU frequency. However, the strategy may be suboptimal for the second ones because the execution is limited by the FIGURE 7 Algorithmic formulation of the DVFS approach FIGURE 8 Execution traces of the PCG-DVFS iterative solve preconditioned with ILUPACK for 8 threads memory transfer rate. In this case, the selection of an appropriate CPU frequency allows to adjust memory and CPU rates, trading off the execution time for power consumption, such that the energy efficiency is properly improved.
Many of the computations in the PCG are BLAS-1 and BLAS-2 operations, which are memory-bound operations. As in the previous section, we only consider the most expensive operations (the leaf nodes of SpMV, LwTrSv, and UpTrSv) that include different operations applied on sparse matrices, with a low memory transfer rate. Therefore, the best energy-aware implementation consists in determining the best P-state related to the corresponding tasks, instead of the highest frequency. However, the best frequency depends on the data (including its partition and its mapping) and on the machine architecture; therefore, it is known prior to the solution of the linear system.
A procedure similar to that described in Figure 7 can be defined to find the optimal frequency (see Figure 9), but, in this case, the execution time metric is changed to energy consumption. Moreover, this procedure always considers the whole operation, avoiding individual threads, and consequently, the changes on the P-states affect all individual tasks. Thus, the variable EnCon refers to the energy consumption of the analyzed operation. The main objective of the procedure is to increase the P-state of the threads as long as the last two energy-consumption readings satisfy a condition. Otherwise the P-state is decreased and the procedure finalizes. This condition incorporates the parameter ErrAllowed, which controls the relationship between the two readings. In this way, a value of the parameter greater than one allows to change the P-state of the threads although it produces a small loss of performance.

Combining energy-aware methodologies
The techniques described in the previous sections are orthogonal, and therefore, they can be combined. In theory, the maximum benefit should be achieved when the three methodologies are applied, indeed this assertion will be confirmed by means of the experimentation included in the next section. Note that when the DVFS approach and the MBA strategy are combined, the first one has to be changed because the application of DVFS FIGURE 9 Algorithmic formulation of the MBA strategy has to be made when the MBA strategy has finalized. In this way, the first line of the algorithm in Figure 7 has to be removed because the initial frequency of the threads has been previously fixed using the algorithm of Figure 9.

EXPERIMENTAL RESULTS
In this section, we present the impact of the energy-aware techniques on ILUPACK. With this aim, we compare the performance and energy efficiency of the implementation without any energy-aware technique, which is referred to as Performance-Oriented (PO), and the implementations on which some of the energy-aware techniques have been included. Dirichlet boundary conditions, u = g on Ω, was generated, whose discretization is a sparse symmetric positive system. The other two matrices in the experimentation (audikw_1 and ldoor) correspond to examples from the SuiteSparse Matrix Collection, 24 with close to 1,000,000 rows/columns and different sparsity patterns. Table 1 shows some relevant features of these matrices.

Hardware and software setup
Energy was measured using Intel's RAPL (Running Average Power Limit) interface, 25 reflecting the estimated consumption of the core-uncore (package), DRAM, and the total (core, uncore and DRAM) system. For the Haswell-EP, the isolated on-core consumption is not provided by RAPL.
The idle energy was obtained by executing the Linux sleep command during 60 seconds in all cores. This value was then subtracted to the total energy in order to obtain the net energy. The experiments were executed after a warm up period of 120 seconds using a busy-wait loop, and each experiment was repeated 10 times, showing the average values. Moreover, we use the cpufrequtils 26 package to vary the CPU performance state. The experiments analyze the performance and energy efficiency of the different implementations when they are executed on a single socket.
Therefore, we only report the results of the corresponding 8-core processor.

Experimental setup
The right-hand side vector b in the iterative solvers was always initialized to the product A(1, 1, … , 1) T , and the PCG iteration was started with the initial guess x 0 = 0. The parameter that controls the convergence of the iterative process in ILUPACK, restol, was set to 10 −6 , whereas the drop tolerance and the bound to the condition number of the inverse factors, which control ILUPACK's multilevel incomplete factorization process, were set to 0.01 and 5, respectively. The approximate number of PCG iterations required to solve the corresponding linear system using these parameters, for a 32-leaf nodes TDG and employing IEEE754 real double-precision arithmetic, is also included in Table 1.
For each implementation, we have considered two different scenarios: balanced and unbalanced mappings. The first one is the default use of the library, in which the leaf nodes are sorted in the shared task queue regarding the number of nonzero elements, so that, during the preconditioner computation, the costlier nodes are factorized before. On the contrary, the second one has been manually built to generate an unbalanced execution, so that it is possible to verify how the energy-aware strategies are adapted to these situations. In each one of these scenarios, the same mapping of tasks to cores has been used for all the energy-aware implementations. Under these conditions, the differences are only related to the saving features of the corresponding techniques.
We have applied an incremental methodology in order to analyze the impact of each technique in the energy-consumption improvement. Thus, each new tested implementation always incorporates the previously techniques. Concretely, the DVFS implementations also include the RTH strategy, and the MBA implementations also use the two previous strategies.
The DVFS approach and the MBA strategy require to measure the energy consumption of each core when the operations are executed. To avoid the impact of errors in the measurement, both strategies accumulate the results of several PCG iterations. In theory, evaluating a larger number of iterations should improve the correctness of the measures, but the adjustment period and its overhead are extended. In our experimentation, 2 or 4 iterations have been tried, and similar improvements were obtained, although the overhead was smaller in the first case. Therefore, only the results for 2 iterations are shown next.
The parameterization of the MBA strategies requires to fix the parameter ErrAllowed, which was set to 1.000 or 1.005, and to determine on which type of tasks the P-state is changed. Our experiments show that it is more efficient to modify only the frequency of leaf nodes than to change the frequency for all the tasks.

Analysis of the results
The performance and energy efficiency results are displayed by means of the tables, in which we expose the relative improvement of the corresponding variant with respect to the PO implementation. This computation is calculated as follows: where ref PO is the value of the PO implementation related to the matrix that is used in the studied implementation (IMP). Note that negative values for this expression reflect a decrease of the performance, whereas positive values reveal improvements. In the tables, we multiply this result by 100, in order to show the corresponding percentage. Tables 2 to 4 show the performance and energy efficiency of the different strategies for the matrices in Table 1. In general, the first analysis of these tables can conclude that the energy improvements for audikw_1 are always greater than those for A200 because the nonzero pattern of the first one is irregular, which yields a more difficult scenario to define a perfectly balanced mapping. In this way, the execution of audikw_1 includes more idle periods, and therefore, it offers more options to apply energy-aware techniques. The energy improvements of ldoor are the worst in all the strategies because the execution time per PCG iteration of this matrix is smaller. The reason is that this matrix and its factor are sparser, and therefore, there exist less chances to reduce wait-time drains. This is especially clear when the MBA strategy improvements for ldoor and the other two matrices are compared. The comparison of the two mappings for the three matrices renders the same conclusion: these strategies are more profitable for unbalanced mapping because there are more idle periods. Race-to-halt strategy. The first remark after applying the RTH methodology is that, for the three matrices, types of mappings, and stages (PrCo and PCG), the impact of this strategy on the execution time is practically negligible because it is always less that 0.06%. In several cases, some minor improvements are achieved, arriving to 0.44%.
The same conclusion can be obtained for the DRAM values, since the use of the memory in this implementation has not changed with respect to the PO case. Therefore, the DRAM changes are directly related to the changes in the time.
The analysis of the total and net values of energy reveals the main advantage of this implementation, that is, the reduction of the package values and, consequently, the global ones. This improvement is more visible for the net energy and net power values.
For both matrix mappings, the previous assertions are fulfilled: this strategy reduces the energy-consumption and maintains the execution time.
The relative variations of the net package energy for the PrCo are remarkable because they are higher than 5%, 14%, and 9%, respectively, for the unbalanced mapping of the three matrices. Furthermore, the increment of the net package energy for the PCG stage is interesting. It is higher than 3.3%, 5.3%, and 5.3%, respectively, when we use the unbalanced mapping of the matrices. Moreover, the improvements are also appreciable for the balanced mappings. Dynamic VFS-based approach. Following the incremental methodology mentioned earlier, we applied several DVFS implementations on the RTH case. Here, we consider three implementations.
• DVFS_1: The DVFS approach is only applied on SpMV.
• DVFS_3: The DVFS approach is applied on SpMV, LwTrSv, and UpTrSv. Tables 2 and 3, reporting the number of P-state changes (see Figure 7) with respect to the previous row in the table. Therefore, all the values in the DVFS_1 row correspond to the application of the DVFS approach on SpMV, the DVFS_2 row shows the impact of the strategy on LwTrSv, whereas the DVFS_3 row is focused on UpTrSv.

A new column (Add Steps) is included in
In this approach, the impact on the execution time is still really small, since the reduction is always below 0.9%. There appear some small improvements, close to 0.6%, revealing one of the strengths of the strategy. The growth of the execution time is basically due to the overhead of the code in Figure 7, and therefore, it grows with the number of P-state changes.
Unlike the RTH strategy, now there is no direct relationship between DRAM values and execution time. One might expect that the memory would consume more energy because the threads spend a longer period in operation, executing and demanding data from memory, but the tables expose that this assumption does not hold in many cases. If the increment of the execution time is mainly related to the overhead of this strategy, it does not have any influence on the memory. Probably, the values of the tables show that the reduction of idle periods also reduces the overhead related to C-states transitions, and therefore, the threads are less time executing tasks, diminishing the memory working-time. This conclusion holds except for the case with more P-state changes (unbalanced mapping for audikw_1), presumably because, in that case, the additional active time of the threads is higher than the C-state savings.
Again, the improvements of the total and net energy values are the main benefits of this strategy. The tables exhibit the positive impact on the energy and power consumption of changing P-states, even though the transitions introduce additional overheads. Overall, the net package energy savings for the unbalanced mapping are close to 5%, 14.5%, and 14%, respectively, for the three matrices, whereas they are still visible for the balanced mapping for matrices A200 and audikw_1, improving 4.03% and 5.81%, respectively.  Memory-bound aware strategy. The main objective of this strategy is to minimize the energy consumption of the implementation. Here, it is possible to use several methodologies to achieve this objective. We have considered the following four different variants.
In this strategy, the meaning of the last column in the tables (Add Steps) is different from that in the DVFS approach because, now, the values in this column do not represent the number of P-state changes applied on single tasks, but they refer to changes on all the leaf node tasks. Therefore, to obtain the real number of P-state, the bold values in the tables should be multiplied by 32.
The increase of the execution time of the MBA strategy is directly related to the error threshold, such that this time is augmented when the error is increased. In any case, the maximum decrease is around 10%. Moreover, the improvement of the total energy is always greater than the reduction of the performance. Furthermore, the comparison of balanced and unbalanced mapping allows to conclude that its behavior is due to the error rather than to the mapping. Thus, for A200, the error = 0.0 cases are close to 6.85% whereas the error = 0.5 cases are close to 9%.
It is also remarkable that the bold values for the balanced and unbalanced cases are really similar, but they are matrix-dependent. For A200 and audikw_1, those values are, respectively, in the intervals (21,27) and (21,26), whereas the interval for the ldoor is (7,12). The small values of the last interval are due to the corresponding reduced execution time. Therefore, it is not possible to adjust the CPU frequency and the memory transfer rate. On the other hand, the similarity of the two previous intervals fixes the bold values in which the CPU and the memory rates are perfectly adjusted.
The reduction of the energy consumption is relevant for the package and global values. Minimizing global energy produces improvements on the package energy, occasionally, even more, that if only package energy is optimized. Moreover, allowing some errors improves the energy consumption in many cases. The savings are huge, with the improvements of the package net power being greater than 39%, 53%, and 25%, respectively, for the unbalanced mapping of the three matrices; and close to 39%, 45%, and 15% for the balanced case. The similarity between the balanced and unbalanced cases for A200 and audikw_1 confirms that the adjustment of the CPU frequency and memory transfer rate are achieved when the bold values are between 21 and 27. Finally, although it could be expected, the improvement obtained with an error of 0.5% should enhance that computed without any error; this is not always true.

CONCLUSIONS
Several energy-aware strategies have been introduced in this paper to improve the energy efficiency of the task-parallel version of ILUPACK, focusing the study in the iterative solve of SPD sparse linear systems. The RTH strategy and the DVFS strategy manage, respectively, the C-states and the P-states of the cores to reduce the energy consumption with a negligible impact on the execution time. Combining these two strategies, the improvements for the net global energy are, respectively, close to 4%, 13%, and 12% for the tested three matrices. Additionally, the inclusion of the MBA strategy allows to achieve higher energy savings at the cost of longer execution time. In this case, the net global energy figures are, respectively, larger than 21%, 29%, and 17%. Moreover, the MBA strategy fix the number of bold values in the interval (21,27) in order to balance the CPU frequency and the memory transfer rate.
In future work, we will consider to extend these techniques for other iterative solvers, on different multicore architectures and on clusters of multicores. Nowadays, almost all architectures implement energy-aware techniques that manage C-states, and therefore, the RTH strategy can be directly applied. However, only some of them allow to change the core frequency separately. If the P-state change affects to all the cores of the processor, only the MBA strategy can be used. Otherwise, both MBA and DVFS can be leveraged. Moreover, MBA strategy requires RAPL interface to obtain the energy consumption. On clusters, the implementation of the RTH strategy will depend on the MPI implementation, which it should allow to use "idle-wait" instead of "busy-wait" in the communications. Moreover, MBA will always adjust the CPU frequency and the memory transfer rate, whereas, if DFVS can be applied, it will allow to reduce the desynchronization of the task execution both among the processor cores and in different nodes of the cluster.
For new CPU architectures on which the Energy-Aware Race to Halt (EARtH) algorithm adapts the CPU frequency to the workload (Intels Speed Shift technology 27 ), the strategies will have to be changed. Thus, the MBA strategy will be substituted by the EARtH algorithm to determine the best frequency to balance the CPU frequency and the memory transfer rate. Later, the DFVS will be applied to fix the best frequency for each task taking advantage of the knowledge of the PCG operations, such that maximum savings are getting.