Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems

We investigate the benefits that an energy-aware implementation of the runtime in charge of the concurrent execution of ILUPACK—a sophisticated preconditioned iterative solver for sparse linear systems—produces on the time-power-energy balance of the application. Furthermore, to connect the experimental results with the theory, we propose several simple yet accurate power models that capture the variations of average power that result from the introduction of the energy-aware strategies as well as the impact of the P-states into ILUPACK’s runtime, at high accuracy, on two distinct platforms based on multicore technology from AMD and Intel.


Introduction
Power consumption has been identified as a crucial challenge that will need to be tackled in order to deliver a sustained EXAFLOPS 1 throughput, at an affordable cost, in high-performance computing (HPC) facilities deployed by the end of this decade [1][2][3]. In particular, although we have enjoyed considerable advances in the performance-power ratio of HPC systems during the last few years-around 5× in the MFLOPS/Watt ratio for the systems ranked in the first position of the Green500 list during the last five years [4]the ambitious goal of building an Exascale system by 2020, which dissipates only 20 MW, demands even more favorable improvements rates.
A major part of the progress experienced in the energy efficiency of HPC systems in these past years has been due to the hardware, concretely, to the aggregation of low-power processors with an increasing number of cores and the adoption of hardware accelerators for HPC [5]. However, much remains to be done in order to leverage the power-saving mechanisms available in today's hardware and a holistic investigation and multidisciplinary effort has to be conducted in energy-efficient HPC, comprising applications, system software as well as hardware [6]. ILUPACK [7] is a numerical package for the solution of sparse linear systems via Krylov-based iterative methods. The software implements multilevel ILU (incomplete LU) preconditioners for general, symmetric indefinite and Hermitian positive definite systems, as those arising in numerous scientific and engineering applications, in combination with inverse-based ILUs and Krylov subspace solvers.
In [8] and [9] we presented two concurrent versions of ILUPACK, for multicore architectures and distributedmemory (message-passing) platforms respectively, that efficiently exploit the task-parallelism intrinsic to the iterative solution of the sparse linear systems (including the calculation of the preconditioner). In response to the increasing energy awareness in HPC, in [10] we analyzed the energy efficiency of the task-parallel calculation of the preconditioner on an AMD-based multicore platform, leveraging the CPU power-saving modes available in this architecture [11]. In this paper we extend our previous work, with the following new contributions: -We introduce energy-aware strategies into ILUPACK runtime and we address the characterization (modeling) of power consumption in the preconditioned solution of symmetric positive definite (s.p.d.) sparse linear systems, thus expanding our previous work [10] on the computation of the preconditioner only, to cover the complete solver.
-We collect precise information on the CPU power-saving modes, identifying the sources of the power bottlenecks and the energy gains, and relating the power consumption and attained savings observed in our detailed experimental results to the power models. -We target a server based on the AMD Opteron 6128 processor, already analyzed in [10], but we also model and evaluate an alternative platform equipped with two Intel Xeon E5504 processors. These servers are representative of current multicore technology and both abide to the CPU power-saving modes in the ACPI (advanced configuration and power interface) specification [11]. From the points of view of power consumption and performance, these platforms present two relevant differences, though. First, all cores of the Intel processor share the same power plane and the core frequency can be only adjusted at the processor (socket) level. Instead, on the AMD server the frequency can be varied per core. Second, as our experiments will show, on the Intel platform memory bandwidth is independent of the processor operation frequency while, on the AMD server, changes to frequency also affect memory throughput. -Finally, we refine the experimental evaluation in [10], using a more accurate wattmeter with a much higher sampling rate than that utilized in our previous study.
As one of the goals of this paper is to analyze the effect that the exploitation of the CPU power-saving mechanisms exerts on the energy efficiency of a complex scientific application, we consider only the power dissipated by the components integrated in the system motherboard (e.g., CPU and RAM chips), discarding other power sinks due, e.g., to network interface, disk, inefficiencies of the power supply unit, etc. For applications such as ILUPACK, which mainly exercise the floating-point arithmetic units of the processors and the memory, we can expect that the contribution of the discarded components is a constant that can be simply added to our models.
There are a large number of software efforts to reduce energy consumption in HPC clusters. For brevity, we next reference only a few. The work in [12] presents a characterization of energy saving techniques for clusters with two large groups: static power management (SPM) and dynamic power management (DPM). Within DPM the authors further distinguish between component-based and power-scalable load balancing (LB) techniques. Our approach, based on the exploitation of C-states, can be considered analogous to that in [13,14], however we apply it at processor level rather than at the node level. From that perspective, our techniques can be classified as LB. The authors in [15] leverage DVFS (dynamic voltage-frequency scaling) to reduce the powerenergy consumption of precedence-constrained tasks running on a cluster. The heuristic-based strategy used by the authors minimizes the slack of non-critical tasks by means of reducing its frequency operation while maintaining the time-to-solution. A similar approach is applied to the execution of dense linear algebra algorithms in multicore processors in [16]. In [17], the authors leverage hierarchical genetic strategy-based grid scheduling algorithms to exploit span via DVFS and reduce energy consumption in computational grids. The simulation results show that proposed scheduling methodologies fairly reduce the energy usage and can be easily adapted to the dynamically changing grid states and various scheduling scenarios. Our strategy considers a static VFS-based approach, in which the voltage/frequency pair is fixed at the beginning of the execution and does not vary thereafter. Our energy savings come instead from the exploitation of the C-states which, in general, lead to considerable energy savings by relying on a race-to-halt/race-toidle strategy. DVFS strategies are part of future work.
The rest of the paper is structured as follows. In Sect. 2 we briefly review the numerical approach underlying ILU-PACK and the task-parallel implementation oriented to multicore architectures from [8]. In Sect. 3 we introduce the setup for our experiments. The next three sections contain the main contribution of the paper: The power model in Sect. 4; the power-aware runtime together with a theoretical and experimental analysis of the impact of the C-states on the time-power-energy balance in Sect. 5; and an analogous study, from the point of view of the P-states, in Sect. 6. Finally, the paper is closed with a discussion of the results in Sect. 7.

Parallel ILUPACK for multicore processors
The approach to multilevel preconditioning in ILUPACK relies on the so-called inverse-based ILU factorizations. Unlike classical threshold-based ILUs, this approach directly bounds the size of the preconditioned error and results in increased robustness and scalability, especially for applications governed by PDEs, due to its close connection with algebraic multilevel methods [8]. Specifically, for efficient preconditioning, only a small amount of fill-in is allowed during the factorization, resulting in a modest number of floating-point arithmetic operations per non-zero entry of the sparse coefficient matrix.
Parallelism in the computation of ILUPACK preconditioners is exposed by means of nested dissection applied to the adjacency graph representing the non-zero connectivity of the sparse coefficient matrix. Nested dissection is a partitioning heuristic which relies on the recursive separation of graphs. The graph is first split by a vertex separator into a pair of independent subgraphs and the same process is next recursively applied to each independent subgraph. The resulting hierarchy of independent subgraphs is highly amenable to parallelization. In particular, the inversebased preconditioning approach is applied in parallel to the blocks corresponding to the independent subgraphs while those corresponding to the separators are updated. When the bulk of the former blocks has been eliminated, the updates computed in parallel within each independent subgraph are merged together, and the algorithm enters the next level in the nested dissection hierarchy. The same process is recursively applied to the separators in the next level and the algorithm proceeds bottom-up in the hierarchy until the root finally completes the parallel computation of the preconditioner; see The type of parallelism described above can be expressed by a binary task dependency tree, where nodes represent concurrent tasks and arcs specify dependencies among them. The parallel execution of this tree on multi-core processors is orchestrated by a runtime which dynamically maps tasks to threads (cores) in order to improve load balance requirements during the computation of the ILU preconditioner. At execution time, thread migration is prevented using POSIX routine sched_set_affinity. The runtime keeps a shared queue of ready tasks (i.e., tasks with their dependencies fulfilled) which are executed by the threads in FIFO order. This queue is initialized with the tasks corresponding to the independent subgraphs. Idle threads have to wait for new ready tasks. When a given thread completes the execution of a task, its parent task is enqueued provided the sibling of the former task has been already completed as well.
The most expensive operation involved in the preconditioned iterative solution of the linear system is the application of the multilevel preconditioner, which is in turn decomposed into two steps: the multilevel forward (FS) and backward substitutions (BS). The aforementioned task dependency tree also describes the parallelism available within both computations. However, while the FS proceeds bottom-up towards the root of the tree, the BS proceeds in the opposite direction. In order to maximize data locality during the parallel multi-threaded execution of both operations, the mapping of threads to tasks resulting from the (dynamic load-balancing) computation of the preconditioner is re-used, so that each thread knows in advance which tasks it is in charge of (i.e., static mapping). The runtime uses a different task queue for each thread and substitution algorithm. For the FS, the queue of each thread is initialized with the leaves it is in charge of, and new (ready) tasks are enqueued on the corresponding queues as soon as their dependencies are fulfilled (i.e., G A G (2,1) G (2,2) G (3,1) G (3,4) (3,2) G (3,3) G First ND level finds separator (1,1) Second ND level finds separators (2,1) Fig. 1 Nested dissection applied to the adjacency graph associated with a sparse matrix and the corresponding task dependency tree as soon as their children tasks are completed). For the BS, only the root task is initially included in the corresponding queue. As soon as the root task is completed, its children are enqueued on the corresponding queues, and the parallel execution (orchestrated by the runtime) proceeds top-bottom while taking care of task dependencies until the computation of the leaves is completed. The other operations involved in the preconditioned iterative solution stage (i.e., sparse matrix-vector product and vector operations) are split and mapped conformally with the FS and BS steps in order to maximize data locality. Moreover, a careful management of shared data, by maintaining consistent or inconsistent copies of the matrix and the vectors, avoids synchronization steps, except those reductions required in order to compute inner products. Further details on the mathematical foundations of the parallel algorithms, their implementation, and the runtime operation can be found in [8].

Environment setup
In all our experiments, we employ a scalable symmetric positive definite sparse linear system of dimension n = N 3 , resulting from a partial differential equation − u = f in a 3D unit cube = [0, 1] 3 with Dirichlet boundary conditions u = g on δ , discretized using a uniform mesh of size h = 1 N +1 . We set N = 252, which yields the largest linear system that fitted into the main memory of the target machine, with about 16 × 10 6 unknowns and 111 × 10 6 nonzero entries in the coefficient matrix. All tests were performed using IEEE double-precision arithmetic. Execution time, power and energy are always reported in seconds (s), Watts (W) and Joules (J), respectively.
We employ two target servers in our experiments. The first platform, wt_amd, is equipped with a single AMD Opteron 6128 processor (8 cores), 24 Gbytes of RAM, and runs the Linux Ubuntu operating system (kernel 2.6.32-220.4.1.el6.x86_64). The second server, wt_int, contains two Intel Xeon E5504 processors (4 cores per socket), 32 Gbytes of RAM, and runs Linux Ubuntu (kernel 2.6.32-220.4.1.el6.x86_64) as well. These two types of processors are representative of current multicore technology, and adhere to the ACPI standard [11] for the CPU power-saving modes. Concretely, the AMD processor features 5 performance states (P-states P0-P4) and three operating or power states (C-states C0, C1 and C1E). The Intel processor has 4 P-states (P0-P3) and 4 C-states (C0, C1, C3 and C6). Information on the voltage-frequency pairs (V i − f i ) associated with each P-state (P i ) is collected in Table 1. State C0 is the normal operating mode (i.e., the CPU is operative) while higher numbers correspond to deeper sleep modes, where more circuits and signals of the processor are turned off, saving more power, but requiring longer time to go back to C0 mode. From the practical point of view, the AMD and Intel servers differ in two important aspects: -The frequency of the AMD cores can be adjusted independently while, on the Intel platform, all cores in the same processor run at the same frequency. In particular, if the cores of a processor from wt_int operate at frequency f i , and we instruct one of these cores to run at f j > f i (using the Linux cpufreq utility), the remaining three cores in the same socket will also transition to operate at f j . On the other hand, if the cores of a processor from wt_int run at frequency f i , and we instruct one core to run at frequency f j < f i , there will be no change. -On the AMD platform, the bandwidth between the cores and the main memory varies with the processor frequency while, on the Intel platform, this bandwidth is independent of the processor frequency. To illustrate this behavior, column BW i of Table 1 reports the bandwidth to the main memory experienced by a single core running the stream microbenchmark [18] at different frequencies.
We note that the bandwidth-frequency dependence is a design decision specific of each processor type: more recent processors as, e.g., the Intel Xeon E52670 "Sandy Bridge" seems to follow AMD 6128's strategy and reduce the bandwidth with the processor frequency [19]; on the other hand, some other processors like the AMD 6274 "Interlagos" apparently abandon this approach [20].
In our experiments, power samples were obtained from the 12-Volt lines connecting the power supply unit to the motherboard of the target platform 2 , using an internal wattmeter  composed of a National Instruments (NI) analog input module (9205) plugged into a NI chassis (cDAQ-9178) and a board of current transducers (LEM HXS 20-NP). The wattmeter is connected via an Ethernet link to a separate power tracing server that runs a daemon application to collect power samples form the internal wattmeter. The measurement application is built by calling routines from the pmlib library [22,23]. We set the sampling rate to 1 kSamples/s, which is high enough to obtain reliable measures for the power model and remaining experiments. Our multithreaded implementation of ILUPACK is built on top of the OpenMP interface available with Intel icc (version 12.1.3) on both platforms. Performance (core activity) traces were captured using the Extrae+Paraver (versions 2.2.1+4.3.4) tracing environment [24]. Traces of CPU power modes were recorded using the Linux interface to read and write model-specific registers (MSRs) and dumped into Paraver-compatible files for interactive visualization and analysis.

The power model and the CPU power-saving modes
We open this section by revisiting the following simple model from [25] for the total (aggregate) power dissipated by an application at a given instant of time t: where P P is the power dissipated by the CPU processor(s) and P Y is the power dissipated by the remaining components (system power corresponding, e.g, to DDR RAM chips, motherboard, etc.). Furthermore, P U is the power dissipated by the uncore [26] elements of the processor (e.g., last-level cache, memory controllers, core interconnect, etc.); and P C is the power for the cores (including the in-core cache levels, floating-point units, branch logic, etc.). While our power models refer to the power dissipated at a given instant of time t, in most experiments next, we will report instead the average power for the application, as this allows us to compile the information for the complete execution in a single figure.
Furthermore, this easily connects the results with the total energy consumption. Given a platform with all cores in state P i , for simplicity we will assume that P Y i and P U i (i.e., the system and uncore powers in state P i ) remain constant during the execution of the application. In practice, starting from an idle (cold) platform, these two factors grow with the system temperature till their addition reaches a plateau [27]. To avoid this effect, we assume that there is a continuous workload to run in the platform and, in order to mimic this situation, all our tests will be performed on a "hot" system, with this state reached by initially warming the cores with an execution of the same kernel or application for a given period of time. Also, we will assume that P U i is independent of the application that runs in the platform. In practice, this is not the case, but our results will show that the errors introduced by this simplification are small, and can be easily accommodated into the model.
To obtain practical values for the power model, we proceed as follows. For simplicity, let us assume that all c active cores of the platform run the same type of task (kernel) k, in the same state P i , during all the execution time. In this scenario, we can easily consider that the total power at instant t equals the average power. For P Y i , we thus simply set all the cores of the platform to each P-state using cpufreq, and then measure the power with the platform idle for 30 seconds and average the results: between 71.86 and 84.83 W, depending on the state P i , for wt_amd; and 33.43 W for all P-states in wt_int; see column P Y i in Table 2. The estimation of P U i and P C i is more elaborated, as it is difficult to separate these two components, and the second depends also on the application that is being run. In order to achieve this, let us start by refining (1), to capture the total power for the execution of c copies of task k, with the active cores in state P i : where P C1 k,i denotes the power dissipated by a single core in state P i running task k.
To estimate the missing parameters in (2), P U i and P C1 k,i , we will leverage three compute-intensive kernels: the cpuburn microbenchmark 3 , a simple busy-wait test consisting of a "while (1) ;" loop, and the general dense matrix-matrix product (gemm) routine implemented as part of Intel MKL operating with double-precision data. Specifically, we executed these tests for 60 s and averaged the power draw, for an increasing number of cores c (all in the same P-state) of the machines (8 cores of both wt_amd and wt_int), while the remaining cores remain in an inactive C-state.
Applying linear regression to the data obtained from this experimental evaluation, we obtained linear models for the total power of the form with the values for α k,i and β k,i in the corresponding columns of Table 2, and the relation between the models and the experimental data graphically captured in Fig. 2. These regression models show quite a perfect fit with the experimental data, offering the same (rough) value α k,i for all three kernel types (the largest variation between the three was 2.11 % and the average difference 0.61 %.) Therefore, in the following we use α i − P Y i as an estimation for P U i (see Table 2), the uncore power dissipated by a socket in state P i (which agrees with our assumption that the uncore power is independent of the kernel type); and we set P C k,i (c) = β k,i · c so that P C1 k,i = β k,i .

Leveraging the C-states in ILUPACK
We next investigate the exploitation of the C-states made by two implementations of the runtime underlying ILUPACK, and relate their costs to the previous power model. In the experiments in this section, we employ all the cores of the target platforms; and we set the Linux governor to ondemand operating all the active cores in the same state P0 during all the execution (i.e., we do not allow voltage-frequency changes). In Sect. 2 we exposed that the task-parallel calculation of the preconditioner in ILUPACK is organized as a directed task graph, with the structure of a binary tree and bottom-up dependencies, from the nodes (tasks) at each level to those in the level immediately above it. The subsequent iterative process basically requires the solution of (lower and upper) triangular linear systems per iteration, with tasks that are also organized as binary task-trees, with bottom-up (lower triangular system) or top-down (upper triangular system) dependencies. In any case, when the tasks of these trees are dynam-ically mapped to a multicore platform by the runtime, the execution should result in periods of time during which certain cores are idle, depending on the number of tasks of the tree, their computational complexity, the number of cores of the system, etc. It is basically these idle periods that we could expect that the operating system leverages, by promoting the corresponding cores into a power-saving C-state (sleep mode). Figure 3 presents the execution trace, power consumption, and C-states observed during the computation of the ILU-PACK preconditioner, using the original (power-oblivious) runtime, with all cores of wt_int in state P0. Surprisingly, the results are quite different from what we had expected: Idle periods do not show a transition of the corresponding core to a power-saving C-state and the associated reduction of the power rate. Figure 4 reports an analogous behavior for the (preconditioned) iterative solution stage on wt_int (and similar results were also obtained for both stages on wt_amd). A closer inspection of the runtime that leverages the task-parallelism in ILUPACK reveals the reason for these surprising results. Concretely, in the original implementation of ILUPACK runtime, upon encountering no tasks ready to be executed, "idle" threads simply perform a "busy-wait" (polling) on a condition variable, till a new task is available. This strategy thus prevents the operating system from promoting the associated cores into a power-saving C-state because the threads are not actually idle (but doing useless work). This performance-oriented decision is far from uncommon, being adopted in runtimes like OmpSs (SMPSs) [28] or libflame+SuperMatrix [29] as well. Furthermore, the same performance-oriented but power-oblivious behavior appears, for example, when a synchronous GPU kernel is invoked with the default operation mode of CUDA [30] (the CPU remains in an active polling, waiting for the GPU to finish), or with the polling mode of certain MPI implementations (e.g., MVAPICH [31]).
As an alternative to the previous power-hungry strategy, we developed a power-aware version of the runtime underlying ILUPACK, which applies an "idle-wait" (blocking) whenever a thread does not encounter a task ready for execution and, thus, becomes inactive. (Note that setting the necessary conditions for the operating system to promote the cores into a power-saving C-state is as much as we can do, since we cannot explicitly enforce the transition from the application code.) As in the original version of the runtime, upon completing the execution of a task, a thread updates the corresponding dependencies identifying those tasks, if any, that have become ready for execution. However, in the power-aware runtime, the thread also ensures that the number of active (non-blocked threads) is, at least, equal to the number of ready tasks, releasing blocked threads if needed. The effect of idle-wait on the power trace and use of the Cstates of wt_int is illustrated in Fig. 5  of the preconditioner, and Fig. 6, for the iterative solution stage. Compared with the performance-oriented (but powerhungry) implementation of the runtime (see Figs. 3 and 4), the new runtime effectively allows inactive cores to enter a power-saving C-state, thus yielding the sought-after power reduction.
The pending question, however, is whether the adoption of the power-aware runtime comes with a performance penalty which may blur the energy benefits, as in most cases the key factor is energy instead of power. Table 3 compares the execution time, average power, and energy consumption of the two runtimes, showing that fortunately this is not the  with the negligible impact of the runtime on the execution time, these power figures also justify similar energy savings. Let us now relate the power-energy reductions attained by the reimplementation of the runtime that leverages the CPU C-states to the power model of the previous section. For this purpose, we need to (i) account the periods of "idle" time during the execution of ILUPACK, with both the original and energy-aware variants, as well as (ii) assess the impact of replacing a busy-wait (polling) for an idle-wait (blocking). In order to tackle (i), we follow a pragmatic approach, and simply execute the actual codes and measure the actual idle and computation times in our case, e.g., using the tracing framework Extrae+Paraver, while collecting power samples with the internal wattmeter and pmlib library. For (ii), we use the data in Table 2 for P Y 0 , P U 0 ; and estimate P C1 ilu,0 = β ilu,0 using a procedure for ILUPACK analogous to that exposed for the three benchmark kernels combined with linear regression. Finally, we assume that a core promoted to a sleep state does not dissipate any core power.
Consider P T pilu,0 (c) and P T bilu,0 (c) denote, respectively, the total power dissipated during the execution of ILUPACK, using the power-oblivious (polling pilu) and power-aware runtimes (blocking bilu), with c cores in state P0. Since now some cores may be inactive during a certain part of the execution, we need to adapt (2), which now becomes The first term of the addition captures the cost of the cores performing useful work during the computation of ILUPACK (like (2)), and appears multiplied by f pilu,0 , which corresponds to the ratio of the total time that this computation occupies. Thus, the second part of the addition represents the remaining fraction of the total time, (1 − f pilu,0 ), and captures the power dissipation of the cores performing polling. In our evaluation, we set P C1 polling,0 = P C1 busy,0 , as the underlying procedures are similar.
On the other hand, for P T bilu,0 (c), we have as we assumed that a core in blocking mode wastes no power (i.e, P C1 blocking,0 = 0). Table 4 compares the values of the theoretical ratios P T bilu,0 (c)/P T pilu,0 (c) with the experimental data (averaged for 10 different executions), showing a very close matching between the two, below 2 %, wt_int and slightly larger, about 4 % for wt_amd. These results confirm the benefits of the power-aware runtime, but also the accuracy of the power model. For all other frequencies, as we will see next, the model always predicted the power-ratio with an error below 2 %.

Impact of the P-states on ILUPACK
In this section we evaluate the effect of the different P-states available for each processor on the performance-powerenergy trade-off of ILUPACK. For that purpose, we set the Linux governor mode to userspace, and operate all the cores of the platforms in the same P-state. In the following, we always employ the power-aware version of the runtime. Therefore, we assume that, when idle, a core will remain in one of the deep power-saving C-states (C1 or higher), consuming a negligible amount of power.
The general consensus is that, for a memory-bound computation, some benefit may result from operating the system cores at low frequencies. The reason is that, although there exists a linear dependence between the core performance and the frequency, the effect on the execution time of a memorybound algorithm should be minor because the key for this type of computation is not core performance but memory bandwidth. On the other hand, for current multicore technology, a reduction of frequency is associated with a decrease of voltage (see Table 1) and, because of the relation between static power to V 2 and dynamic power to V 2 · f , in principle we can expect a significant reduction of the power draw. However, the balance between these two factors, time and power, on the energy efficiency is delicate, and other elements also play a role. Whether these variations of time and power yield a loss or a gain for ILUPACK from the point of view of energy efficiency is thus the question to investigate in this section. Table 5 reports the impact of the P-states on the time, (average) power consumption and energy efficiency of the two stages of ILUPACK, calculation of the preconditioner and iterative solution, on both platforms. To help with the analysis of these results, Table 6 offers the variation ratios of The first aspect to notice is that the presumed independence between execution time and core frequency does not hold on wt_amd. This should not be a surprise as our experiment in Table 1 already revealed that there is a strong connection between the core frequency and the memory bandwidth on this platform (see also column BW i in Table 6). The combined decreases of frequency and memory bandwidth when moving from P0 to a higher P-state (between −25 and −60 % for the former and from −18.68 to −53.78 % for the latter) explain the increases of execution time for both the preconditioning stage (22.60-106.88 %) and the iterative solver (12.54-73.60 %) in this platform. The behavior of wt_int is quite different, which is partially explained because now the reduction of frequency does not bring a decrease of memory bandwidth. Still, for the preconditioner, the reduction of frequency when moving from P0 to a higher P-state (−6.50 % for P1, −13.50 % for P2 and −20.00 % for P3), basically matches the increase of execution time for this stage (5.21, 11.35 and 19.86 %, respectively). We can take this as an indicator that the computation of the preconditioner (or, at least, parts of it) is not such a memory-bound computation as one could, in principle, presume. The results are different for the iterative solve. In this case, there is no significant difference in the execution time when running the stage in states P0 or P1, but the time increase when moving from P0 to P2/P3 is 2.16/11.27 %, which is still lower than what could be explained by the reduction of frequency alone.
From the performance point of view, the major conclusion of this analysis is that the best solution is to always run ILU-PACK with all the cores operating at the highest frequency (i.e., in state P0), though in some cases-in particular, the iterative solver executed in frequencies P0, P1 and P2-,the differences are small on wt_int.
Performance is crucial and, under some circumstances, energy efficiency is also vital. From that point of view, a reduction of power is beneficial only if it does not yield an increase of execution time that blurs the positive effects on energy consumption. For the particular case of ILUPACK, the results in Table 5 show that, on wt_amd, the most energy efficient solution is to execute the preconditioner with all cores in state P0 but the iterative solver in state P1. For wt_int, however, using states P1, P2 or P3 for the iterative solver results in small significant energy savings, from −1.93 to −4.25 %. Let us connect again the power variations attained with the different P-states and the models for total power. For this purpose, we relate P T bilu,i (c) and P T bilu,0 (c), using and the experimental data. Table 7 reports the accuracy of our model to capture the experimental behavior due to the variations of the P-state on ILUPACK, with an error at most 3.08 % for wt_amd and even smaller for wt_int.

Concluding remarks
We can list two main contributions for this work: (i) the implementation of an energy-aware runtime for the complete preconditioner+iterative solve process in ILUPACK; and (ii) the elaboration and experimental characterization of simple models for the (average) power that explain/justify the variations observed for the new energy-aware runtime and the effect of the different P-states for this particular application and two different multicore architectures. The introduction of the energy-aware runtime results from the experimental observation that, in an energy-oblivious execution of the original runtime for ILUPACK, idle threads with no useful task to execute simply poll till new work is available. As a result, these threads dissipate a significant amount of power in current processors, for no practical performance benefit for the particular case of ILUPACK. Our energy-aware implementation replaces this behavior with a more power-friendly implementation, that blocks idle threads till new work is available. This requires a careful reorganization of the underlying runtime, to avoid deadlocks and ensure a rapid response that does not impair performance. In our experiments, we observed savings in the energy usage between 7 and 13 %, at practically no cost from the performance point of view, which are clearly connected to the impact of the C-states by our power model.
In summary, the approach adopted in this paper is based on the exploitation of idle periods during the concurrent execution of a task-parallel version of ILUPACK for multithreaded architectures. By replacing the busy-waits of the runtime in charge of execution with idle-waits, we favor the introduction of race-to-idle, which in turn allows the operating system to promote the hardware into a more energy-efficient C-state. We believe that the same technique can be applied to other task-parallel scientific codes and, as part of ongoing work, we are currently embedding this approach into a general runtime like OmpSs, modified to embrace idle-wait, so that we can evaluate the performance-power-energy trade-offs for other task-parallel applications.
Busy-wait and idle-wait are analogous to well-known concepts of operating systems like spinlock and mutex, respectively. The reason that idle-wait is beneficial for ILUPACK is that the number of changes between busy and idle periods is moderate and the duration of these periods is "long enough". This depends on a number of factors including not only the number and duration of the periods but, e.g., also the costs in time and energy of blocking/releasing the threads, the costs in time and energy of the changes between different C-states, etc.
In theory, there exists a linear relation between performance and frequency, which could be expected to be even sublinear (at least on the Intel processor) for a presumedly memory-bound computation like ILUPACK, and a quadratic/cubic relation between energy and voltagefrequency. However, the analysis of the time-power-energy trade-off when the cores operate in a certain P-states (voltagefrequency pair), with the energy-aware version of the runtime, reveals the high impact of idle and, to a minor degree, uncore power which clearly favor shorter execution time over lower power dissipation rates. This is also contrasted to and accurately captured by our power model. amb caràcter temporal, de personal investigador júnior a les universitats públiques del sistema universitari català PDJ 2013".