Communication in Task-Parallel ILU-Preconditioned CG Solvers using MPI+OmpSs

SUMMARY We target the parallel solution of sparse linear systems via iterative Krylov subspace-based methods enhanced with ILU-type preconditioners on clusters of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the classical iteration for the CG method, accelerated via ILUPACK and ILU(0) preconditioners, using MPI+OmpSs. In addition, we integrate several communication-avoiding (CA) strategies into the codes, including the butterﬂy communication scheme and Eijkhout’s formulation of the CG method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 16 nodes, with 16 cores each. Copyright c (cid:13) 2016 John Wiley & Sons, Ltd. Received ...


INTRODUCTION
The solution of linear systems is an ubiquitous computational problem appearing in fundamental numerical simulations as well as in recent methods for data analytics [1].In order to tackle sparse instances of these linear algebra problems, ILUPACK † provides sequential implementations of several key Krylov subspace methods, such as CG, BiCG, SQMR and GMRES [2], enhanced with a sophisticated multilevel Incomplete LU (ILU) preconditioner [3].
The notable cost of computing and applying ILUPACK's preconditioner, in the framework of the CG method for large symmetric positive definite (s.p.d.) linear systems [2], motivated the design of a task-parallel procedure for clusters of multicore processors (see [4] and the references therein).For this type of computer architectures, this parallel version of ILUPACK exposes task-parallelism via nested dissection applied to the coefficient matrix of the linear system.As a consequence, it usually performs more floating-point operations (flops) than the original ILUPACK, with the overhead mildly growing with the number of tasks [5].Furthermore, the parallel implementation calculates a preconditioner which differs from that computed by the sequential ILUPACK, yielding distinct convergence rates (though they are not necessarily slower).
-The butterfly transformation scheme for collective communication primitives [9,10].-Eijkhout's formulation of the PCG method [11] that shifts all reduction operations into a single synchronization point per iteration.-A modification to merge and reduce the volume of tasks corresponding to multiple "communication-less" vector operations that diminishes the scheduling overhead.
• We discuss in detail the communication and synchronization patterns appearing in these 8 parallel implementations (two preconditioners × four formulations: classical+enhanced with 3 CA strategies).• For all these implementations we perform a comparative analysis of their weak and strong scalability on a cluster consisting of 16 nodes, with two 8-core sockets per node.Our experiments show that, in order to offer a significant acceleration, the CA strategies will require additional optimization and/or an evaluation at a larger scale.Unfortunately, the current parallelization approach offers a limited degree of scalability.The primary reason is that, as the amount of computational resources grows, the additional concurrency which is explicitly exposed by partitioning the computational load (sparse matrix/adjacency graph) of the problem into further levels does not compensate the overhead that is introduced by the splitting.
The communication analysis performed in this work carries over to several alternative CA variants of the CG solver (see related work next), enhanced with an ILU-preconditioner such as ILU(0), ILU(p), ILUT, etc. [2], that rely on task-parallelism.On the other hand, the analysis of the numerical properties of these distinct preconditioners and CA variants has been performed elsewhere and it is out-of-scope for this paper.We instead focus on their parallel properties, communication patterns/costs, and synchronization overheads.

Related work
Task-parallelism is usually exploited by parallel direct solvers for sparse linear systems [12,13,14,15,16].Our approach also leverages task-parallelism but, when applied to iterative ILUpreconditioned solvers, yields significantly different properties for the task dependency graph (TDG) associated with the problem and the distribution of the computational cost among the graph nodes [5].These factors dictate the regular communication patterns yielding the parallel efficiency discussed in this work.
The cost of process/thread synchronization in parallel computers has resulted in a number of research efforts pursuing CA re-formulations of iterative solvers for linear systems.In general, these solutions trade off rounding errors, which may produce slower convergence to the solution, for higher parallel performance.A collection of CA variants of the CG method for s.p.d.linear systems are surveyed in [11,17,18].Our work differs in that we target the solution of sparse linear systems using task-parallel CG solvers and CA variants of these, in combination with (simple) ILU(0) as well as (complex) ILUPACK preconditioners.
Figure 1.Partitioning of the coefficient matrix A (left) and task dependency tree for the diagonal blocks (right).Task T j is in charge of processing the diagonal block A jj .T 6 is associated with the separator obtained from the first application of nested dissection, and T 4 , T 5 with those after the second one.

ILUPACK
Consider the linear system Ax = b, where A is the n × n sparse coefficient matrix, b is the righthand side vector, and x is the sought-after solution, with both b, x of size n.For s.p.d.linear systems, ILUPACK's implementation of the CG method integrates an efficient preconditioner obtained from an incomplete LU factorization A ≈ LU = LL T = M .This method relies on dropping combined with pivoting to bound the norm of the inverse triangular factor L, yielding a numerical multilevel hierarchy of partial inverse-based approximations [19,20].(For s.p.d., ILUPACK specifically computes the factorization A = L(DL T ) = LU , where L is a lower triangular factor and D is a diagonal matrix.For simplicity, we will skip the diagonal scaling from our discussion.)From the perspective of a parallel implementation, the computation and application of the preconditioner are two challenging operations that, furthermore, concentrate most of the computational cost of the solve.We next review the approach to extract task-parallelism for these operations; for details on the numerical foundations of the method, see e.g.[5].

Exposing task-parallelism
The parallel version of ILUPACK recursively applies nested dissection in order to identify a collection of independent blocks in A via row/column permutations [5].This process defines a hierarchy of subgraphs and separators, which then dictates the order in which the blocks have to be factorized during the preconditioner computation and the triangular systems need to be solved during the subsequent PCG iterations.Concretely, the hierarchy defines a TDG with a binary-tree form, where the subgraphs occupy the leaves of the graph/tree and the separators correspond to the internal nodes.Figure 1 (left) illustrates the effect of applying two nested dissection steps on a sparse matrix, yielding 4 subgraphs and 3 separators.Note that, due to the symmetry, A ij = A ji .Figure 1 (right) shows the corresponding TDG of the transformed matrix.The edges of that graph specify the dependencies between the diagonal blocks (tasks) for the preconditioner computation (i.e., the approximate matrix factorization).
Computing the preconditioner In order to increase the degree of concurrency during the computation of the preconditioner, the submatrices for the graph in Figure 1 are decomposed as where Applying the preconditioner The subsequent application of the preconditioner M = LL T during the iterative PCG solve requires the solution of two triangular systems, for the lower triangular factor L and its transpose, per iteration.Consider, for example, the solution of the linear system Ly = c, with L featuring the same block structure as the lower triangular part of A in Figure 1 where In (3), the leading blocks L 00 , . . ., L 33 define four lower triangular linear systems that can be solved concurrently for y 0 , . . ., y 3 .After that, forward substitution yields updates values for c 4 , c 5 which are used in the solves with L 44 , L 55 for y 4 , y 5 ; and repeating the process we obtain an updated value c 6 to be used in the final solve with L 66 for y 6 .The TDG for the lower triangular system therefore presents the same binary-tree structure identified via nested dissection; see Figure 1 (right).For the triangular solve involving L T though, the dependencies are reversed, pointing down from the root to the leaves.
The key to high performance during a task-parallel execution of ILUPACK's CG solver lies in concentrating most of the computational work on the leaves of the TDG so that the internal nodes contribute little to the global computation from the perspective of the operation count [5].The reason is that, as one proceeds upwards in the binary-tree TDG, the amount of concurrency is reduced.This lack of concurrency can be compensated if most of the computational work is concentrated in the leaves while the internal nodes only represent a small fraction of the cost.

Other ILU-PCG solvers
The previous approach to formulate a task-parallel execution of ILUPACK's PCG solver also applies to other ILU-type solvers in general and ILU(0) [2] in particular.As a result, the communication patterns that we identify in the next two sections apply to a generic task-parallel ILU-PCG solver.

The iterative PCG solve
We focus the analysis in this section on the iterative PCG solve, as this phase is usually more expensive than the computation of the preconditioner.Figure 2 offers an algorithm description of the classical iterative PCG scheme.The loop body consists of a sparse matrix-vector product (SPMV, S1), the preconditioner application (equivalent to two triangular solves, S5), three DOT products (S2, S6 and S9), three AXPY(-like) operations (S3, S4 and S8), and a few scalar operations.
Compute preconditioner for A ≈ LL T = M Set starting guess x (0) Initialize z (0) , d (0) , β (0) , τ (0) , k := 0 Step Operation Kernel Apply preconditioner: Lower triangular solve Upper triangular solve . Classical formulation of the iterative ILU-PCG solver with annotated computational kernels.The threshold τmax is an upper bound on the relative residual for the computed approximation to the solution.In the notation, < •, • > computes the DOT (inner) product of its vector arguments.

Communications in the classical PCG
For clarity, in the remainder of the paper we will drop the superindices that denote the iteration count in the variable names.Thus, for example, x (k) becomes x, where the latter basically stands for the storage space employed to keep the sequence of approximations x (0) , x (1) , x (2) , . . .computed during the iteration.For the communication analysis, we will first explore the following simplified scenario: 1.The parallel platform consists of two processes, P0 and P1, where the term "process" is generic and refers to either MPI ranks or OmpSs threads.Without loss of generality, we will assume a message-passing communication mechanism operating in a distributed-memory setting.2. The target TDG results from a single dissection step and consists only of two leaves connected by a root node.Correspondingly, the coefficient matrix A is decoupled into 3 × 3 blocks: and is disassembled as where A 22 = A 0 22 + A 1 22 is not built explicitly but instead maintained as two separate addends.State Hereafter, we will say that an operand like this is in inconsistent state, and we will distinguish this property by adding a hat to its variable name, as in Â.This implies that recovering a consistent state requires some arithmetic (e.g., ). Storage The data for an inconsistent matrix (or vector) is always distributed between the processes.For the particular case of matrix A, P0 and P1 respectively store A 0 and A 1 .
This means that some blocks of the original matrix are kept by a single process only (e.g., A 00 and A 20 in P0), while others can be only retrieved after some communication (e.g., forming A 22 requires that either A 0 22 or A 1 22 are transferred from one process to the other).

All vectors are partitioned conformally with
4. The residual r and ŵ are maintained in inconsistent state.For example, the subvectors 2 ) are respectively stored in P0 and P1.This is conformal with the inconsistent state (and storage) of Â. 5.The solution vector x and the recurrence vectors z, d maintain consistent information; i.e., no arithmetic is required to recover their complete state.In most cases, the data entries of consistent variables are partially replicated so that, e.g., P0 and P1 store respectively.Thus, both processes keep their own copy of the "common" part x 2 , while each process stores its "local" part with x 0 in P0 and x 1 in P1. 6.At each iteration, during application of the preconditioner, a consistent/partially replicated copy is created for r. 7. The scalars α, β, ρ, τ are globally replicated.
Sparse matrix-vector product Let us analyze next each one of the operations comprised by the loop body of the PCG iteration, starting with the SPMV (S1).The inputs to this operation are Â (inconsistent) and d (consistent/partially replicated).Therefore, the following computations can proceed concurrently: Thus, by keeping the result inconsistent (in our notation, ŵ := Âd,) this operation requires no communication.This is a direct consequence of the states of the input/output operands, defined in the introductory part of this section.
DOT products The next operation is the DOT product S2 between d and ŵ.Here, it is easy to see that the following partial results can be computed concurrently: after which, P0 and P1 exchange σ 0 and σ 1 , and then compute the globally replicated scalar ρ := β/(σ 0 + σ 1 ).The same idea applies to the DOT products in S6 and S9.
AXPY(-like) vector updates Consider now the AXPY operation in S3, which involves vectors x, d and the scalar ρ.Both processes replicate part of this computation (concretely, that involving x 2 , d 2 ), to obtain the result without any communication: The same applies to S8 if we require that z is maintained consistent/partially replicated during the iteration.On the other hand, for the AXPY in S4, the input vectors r, ŵ are inconsistent.This means that the updated result can be maintained in the same state.Moreover, it can be computed concurrently, without any communication, as: Lower triangular solve Let us partition the triangular factor resulting from the ILU factorization This factor is maintained in consistent state while its data is disassembled and (fully) distributed among the processes with stored in P0 and P1, respectively.Thus, the processes do not keep duplicate copies of any block of the triangular factor, but the information in it is consistent (no arithmetic is required to recover any block of L).Hereafter, we distinguish this by adding a bar to the name of such variables, as in L. Now, the lower triangular system in S5.1 boils down to and we can then solve it by first computing in parallel After this, P0 sends t 0 to P1 (the owner of L 22 ) which then proceeds to compute: while P0 remains idle.We note that the solution shares the properties of L: it is consistent but distributed conformally, with y 0 in P0 and y 1 , y 2 in P1.We will therefore refer to it as ȳ.
Upper triangular solve The solve in S5.2, LT z = ȳ, can be partitioned as where we will leverage that the triangular matrix L and the right-hand side ȳ are both consistent and distributed.In this case, we commence by solving a triangular system in P1: P1 : z 2 := L −T 22 y 2 , after which, this process sends the result z 2 to P0. Next, in parallel, we obtain Level L3  In P5 (and P4): In P7 (and P6): Transformation of the residual In preparation for the DOT product in S9, the application of the preconditioner creates a consistent/partially replicated version of r.For this purpose, during the lower triangular solve, P0 sends r 0 2 (together with the message for t 0 2 ) to P1.This information is then used by P1 to obtain r 2 := r 0 2 + r 1 2 , building a consistent but distributed copy of the residual vector.In our notation, this is equivalent to the transformation r → r.Next, during the subsequent upper triangular solve, (and together with the message containing z 2 ,) P1 sends a copy of r 2 to P0, so that upon completion the platform stores the sough-after consistent/partially replicated residual vector after the preconditioner application: r → r.
Generalization Let us consider now a more general scenario, consisting in a parallel platform with 8 processes, P0-P7, and a TDG resulting from three nested dissection steps, composed of 8 subgraphs (leaf nodes 0-7) and 7 separators (internal nodes 8-14); see Figure 3.

9
The inconsistent state of r implies that the subvectors r 8 , r 9 , . . ., r 14 are maintained implicitly as: , i = 12, 13, j = 4(i − 12); and where the superindex indicates the process that owns that subvector.Thus, for example, the 8 subvectors necessary to build r 14 = r 0 14 + r 1 14 + • • • + r 7 14 are mapped to P0-P7, respectively.Figure 4 shows the messages exchanged and the computations performed in order to transform r from inconsistent to consistent/distributed r.This communication pattern can be easily recognized as a binary-tree reduction following the mapping of TDG nodes to processes in Figure 3 (right).The same pattern appears during the simultaneous lower triangular solve Lȳ = r, with the difference between transformation and solve residing in the specific operator that is applied to r.For example, in Level 3 of the solve procedure, the following computations occur initially in P0 and P1: P0 : y 0 := L −1 00 r 0 , t 0 14 := r 0 14 − L 14,0 y 0 , t 0 12 := r 0 12 − L 12,0 y 0 , t 0 8 := r 0 8 − L 80 y 0 ; P1 : y Next, P0 sends t 0 14 , t 0 12 , t 0 8 to P1, which are then used by the latter process to aggregate these with the local information forming t 0−1 14 , t 0−1 12 , t 0−1 8 (= t 8 ) in preparation for the next level of the triangular solve.The upper triangular solve LT z = ȳ basically reverses the communication pattern in Figure 4, yielding a binary-tree broadcast.
After some initial local computations, from the communication point of view, the DOT product simply involves a global reduction (Alltoall in the MPI world).
To close this discussion, the dimension of each message is directly derived from the number of elements the corresponding subvector of r, which in turn depends on the dimension of each diagonal block of Â.Thus, all r j i and t j i have, at least, the same size as the number of columns/blocks in A ii .
Summary The previous elaboration identifies the communication patterns underlying a taskparallel ILU-PCG solver as well as some interesting observations for a TDG of arbitrary dimension: 1.The SPMV does not require any sort of communication.
2. The AXPY(-like) operations do not involve any communication either.This type of kernel is embarrassingly parallel, but those calls operating with consistent/partially replicated vectors replicate part of the computation to avoid the communication.

Each DOT product requires a global reduction (synchronization point). The loop body of the
PCG solve can be re-arranged so that S9 is pushed up next to S6.A simultaneous execution of these two reductions then decreases the number of synchronization points from 3 to 2 per iteration in the classical PCG iteration.4. The application of the preconditioner requires a binary-tree reduction for the lower triangular solve followed by a binary-tree broadcast for the upper triangular one.5.The transformation r → r → r occurs during the application of the preconditioner and the message exchanges required for this operation can be combined with those for the triangular solves.6.The price to pay in order to expose increased levels of concurrency in some kernels is a partial replication of arithmetic, in particular due to the operation with the inconsistent matrices and vectors.An appropriate application of nested dissection aims to find a graph with small separators that concentrates most of the computational work on the leaves.
The transitions between states and storage patterns dictating the communications are summarized in Figure 5.

Mapping the TDG
The task-parallel implementation of ILUPACK's PCG method for clusters of multicore processors in [6] partitions the TDG at a certain level, mapping the tasks in the parts of the tree above this division to MPI ranks and those in the lower levels to OmpSs threads.The same observation applies to similar task-parallel implementations of alternative ILU-PCG solvers.Figure 6 shows this partitioning between the MPI and OmpSs "worlds" for a simple TDG consisting of 8 leaves that represents the solve with the lower triangular factor L. In the figure, two OmpSs threads per node collaborate to execute the tasks within the bottom two levels.For the top two levels, four MPI ranks take over the computation in order to process the tasks, while the OmpSs threads remain idle.The same partitioning between MPI and OmpSs occurs during other operations of the PCG iteration.On the leaves, the OmpSs threads operate with their local submatrices/subvectors.For the DOT products, these computations are propagated into the MPI world next in order to obtain the global result.
From the perspective of synchronization, there is little difference between the communication operations performed inside the MPI and OmpSs worlds.However, the combination of MPI+OmpSs allows this hybrid programming model solution to exploit dynamic scheduling within the cluster nodes via OmpSs.This implies that there is no a priori mapping of the tasks to the thread team, offering higher flexibility than the MPI solution and, in general, higher performance [6].
Inside OmpSs, task dependencies are mostly controlled with a fine-grain granularity that operates at the level of nodes of the TDG.Thus, for example, the computation of the tasks for the DOT product in S2 commence as soon as the tasks for SPMV (S1) generating the corresponding parts of ŵ have completed their execution, with these dependencies in the loop body controlled by the OmpSs runtime.Inside the triangular solves (S5.1 and S5.2) the task dependencies are also controlled with a granularity dictated by the nodes of the TDG.The fact that some of these operations require explicit (MPI) communication points turn these synchronization among OmpSs tasks unavoidable, in practice yielding a barrier.This occurs, for example, within the DOT products as well as in the transition between the lower and upper triangular solves.
At this point, we recognize that a similar task-parallel hybrid (MPI+X) version could have been obtained using, e.g., OpenMP instead of OmpSs.The primary reason for adopting OmpSs in our codes is that, when we commenced our development of a multi-threaded task-parallel version of ILUPACK, the OpenMP standard did not support tasks.During the past few years, OpenMP has integrated this mechanism (in part, under the influence of task-parallel programming languages like OmpSs) and the differences between OpenMP and OmpSs tasking mechanisms are now blurry.However, we believe that OmpSs still provides some advanced features, such as sophisticated support for nested parallelism and task priorities, that can deliver slightly higher performance for some of our implementations.

Butterfly communication pattern
The butterfly transformation is an efficient communication scheme for collective operations [9,10].In the framework of the task-parallel PCG, this technique can be leveraged during the application of the preconditioner in S5, provided the lower triangular factor L is consistent and partially replicated.For our simplified scenario, this implies that P0 stores L 00 , L 20 , P1 stores L 11 , L 21 , and both of them have a copy of L 22 .The direct consequence is an increase in memory usage, but there are also some changes on the communication patterns for the preconditioning step that we review next.
Lower triangular solve In this new scenario, with L consistent and partially replicated, we solve the lower triangular system in S5.1 by first computing in parallel However, now P0 and P1 exchange t 0 2 and t 1 2 , and then replicate the computation: the tolerance threshold τ .This in turn is necessary in order to check the convergence test at the beginning of the following iteration.If we perform the convergence test every s iterations, we can save the transformation r → r during the lower triangular solve in S6.2 (and the DOT product for τ ) during s − 1 iterations.With this option, the binary-tree reduction still occurs, but involves less data.
In exchange for this, the solver may incur up to s − 1 extra iterations.Nevertheless, a rough analysis of the convergence rate for the iteration can offer some heuristic strategy to reduce this risk.
For the experimental analysis, we relied on a s.p.d.coefficient matrix arising from the finite difference discretization of a 3D Laplace problem; see Table I and [5] for details.For the linear systems, the right-hand side vector b was initialized with all entries set to 1, and the PCG iterate was started with the initial guess x 0 ≡ 0. The parameters that control the fill-in and convergence of the iterative process in ILUPACK were set as droptool = 1.0E-2, condest = 5, elbow = 10, and restol = 1.0E-6.
We have chosen the Laplacian operator since this particular problem is easy to generate and to verify.Certainly, there exits alternative problems that are harder to solve and tackling this equation is easy from the mathematical point of view.However, the Laplacian problem represents a worstcase scenario for parallel computing.Concretely, since solving single subsystems is easy and fast, the communication overhead for the Laplacian case is more significant and dominating than for other problems.Moreover, the graph structure of the underlying graph reveals all aspects of domain decomposition as obtained by a graph partitioner.This is worse than for many other problems because its uniform structures lead to relatively large-size interfaces, increasing the work that is required beyond the leaf tasks.
solves, the computations corresponding to non-leaf tasks also contribute a mild cost compared with that involving the leaves.The traces identify a synchronization point in the transition between the upper triangular solve and the lower one, in addition to those imposed by the DOT products.This yields three synchronization points per iteration for the classical PCG solver compared with only two for Eijkhout's variant.The negative impact of these synchronizations comes from the communication overhead and, especially for this single node execution, workload imbalance that leads to idle waits and waste of computational resources.

Optimal configuration
In [6] we evaluated the performance of the parallel MPI+OmpSs version of the classical PCG solve preconditioned with ILUPACK for different combinations of MPI ranks and OmpSs threads per node (configurations).Given the node architecture (the same computer targeted in the present work), with 2 sockets/8 cores per socket, we employed 1, 2, 4, 8 or 16 MPI ranks per node and the corresponding number of OmpSs threads that filled all cores per node: 16, 8, 4, 2 or 1, respectively.The analysis of performance and weak/strong scalability in that work offered three key insights: 1.In general, the best configuration maps 2 MPI ranks/8 OmpSs threads per node, mimicking the internal socket/core architecture of the servers.2. The MPI+OmpSs implementation outperforms its pure-MPI counterpart by a non-negligible margin that is in the range 5-10%.3. The best configuration employs 1 leaf per core.While a raise in the number of leaves can potentially produce a more balanced distribution of the workload, it also increases the number of levels and the cost of processing the intermediate (non-leaf) tasks.At a certain point during the application of the preconditioner, the number of tasks in a level becomes smaller than the amount of cores, yielding idle waits.
We thus adopt the same decisions for the remaining experiments, mapping 2 MPI ranks per node, with 8 OmpSs threads per MPI rank; and creating a TDG composed of 1 leaf per core.

Scalability
We next evaluate the scalability of the task-parallel ILU-PCG solvers, focussing this analysis on the effect of the two major CA techniques: the butterfly transformation and Eijkhout's variant.The top two graphs in Figures 10 and 11 show the execution time-per-iteration of the task-parallel PCG solver for the A400 problem as the resources are increased from 1 to 16 nodes respectively using a single socket or both sockets per node.For reference, the latter figure also includes the results for an implementation that relies on MPI only.We note that the results collect the average iteration time.The execution time of each iteration, for the distinct versions, which already provides some information on the dispersion of the values, showed relative small variations between iterations.For example, in the execution with a single node and 16 cores, the average time-per-iteration for the classical variant was about 0.278 s and the standard deviation between the time of the iterations was around 0.0219 s only.Similar dispersion results were observed for other variants and parallel configurations.In general, all implementations show a similar decrease in the iteration time as the number of cores grows, exposing their scalability.The bottom two graphs in both figures assess the weak scalability of the task-parallel ILU-PCG solvers.To perform this evaluation, we set the number of non-zeros of the sparse matrix (n z ) to be roughly proportional to the number of cores.However, we note that n z only offers an estimation of the computational cost, as the fill-in/quality of the preconditioner, among other factors, has a significant role.The bottom row of plots reports the execution time per iteration of the parallel implementations for the different matrices in Table I.These results show that, for all implementations, the execution times grow with the number of nodes/cores and the problem dimension at a similar pace.The reason is that the number of flops per iteration increases with the number of cores and, therefore, levels due to the overhead caused by the operation with inconsistent data.execution that employs a single node for the strong scaling experiment and 8-16 nodes for the weak scaling one.For the butterfly transform, a separate analysis of communication traces revealed that the use of MPI's Sendrecv primitive (necessary to implement the binary-tree reduction-broadcast that is present in this scheme) produces a slightly worse overlap of communication and computation than the regular MPI message exchanges via Send and Recv.We next discuss in detail the comparison between the classical PCG iteration and Eijkhout's variant.For this purpose, we will employ traces for an execution using 8 nodes; with 2 MPI ranks per node and 8 OmpSs threads per MPI rank; and the A400 problem.This yields a TDG with 128 leaf tasks (16 leaves per node, 8 per MPI rank), organized into a binary-tree consisting of 8 levels.
Figure 12 reports traces for the classical iteration vs Eijkhout's variant, in both cases accelerated with ILUPACK's preconditioner.The timelines in the figure reveal several observations: • The limited dimension of the problem compared with the amount of resources (128 cores) produces a substantial increase of the overheads due to inter-process communication/synchronization and the computation with top levels of the tree hierarchy during the triangular solves (bars in magenta color).• The generation of a TDG with one leaf per OmpSs thread is the source of a significant workload imbalance, especially visible for the SPMV and the triangular solves.• The binary tree structure imposes a special synchronization pattern with the same structure in both execution because the upper triangular solve cannot commence till the result of the lower triangular system is available.• For the classical iteration (first and third timelines), the SPMV is immediately followed by a DOT product, which introduces a synchronization in the execution.The duration of these two operations is determined by that of the slowest core.A similar situation appears at the end of the iteration, when the upper triangular solve is followed by (an AXPY) and a DOT product.• Instead of the two periods of idle time per iteration present in the classical PCG solve, we easily identify a single one in Eijkhout's variant.However, the workload imbalance due to the upper triangular solve and SPMV aggregate so that the duration of this single period is approximately equal to the addition of the two periods in the classical PCG iteration.As a result, the higher execution time of Eijkhout's variant is determined by the larger number of flops per iteration (4+4 AXPY(-like) and DOT products vs 3+3).• Unfortunately, increasing the number of leaves/levels of the TDG, which could improve the distribution of the workload, will likely shift more cost to the computation with the non-leaf tasks and an increase of idle resources.

CONCLUSIONS
We have proposed, analyzed and evaluated a number of parallel implementations of the PCG method.All these parallel routines extract task-parallelism via nested dissection and leverage the message-passing programming model underlying MPI and the task-parallel approach in OmpSs to map the execution of the iterative PCG solve to a cluster of multicore processors.They differ in the definition of the iteration recurrences (classical PCG iteration vs Eijkhout's CA formulation) and the type of preconditioner they integrate (ILU(0) vs ILUPACK).In addition, we accommodate complementary communication/synchronization-avoiding strategies such as the butterfly communication scheme and a strategy to merge OmpSs tasks.
Our experimental results show strong and weak scalabilities for all variants on up to 16 nodes/256 cores.(Our experiences with asynchronous versions of MPI Send did not show any performance differences.)These results can be expected to carry over to any task-parallel formulation of a PCG solver that relies on an ILU-type preconditioner as well as Chronopoulos and Gear's, Meurant's and Saad's CA formulations of the CG method, among others.
The analysis also marks a research direction that aims to increase the asynchronism of our task-parallel PCG implementations, and improve workload balance, in order to render a faster execution.For this purpose, we plan to decompose some of the computational kernels in the iteration (especially the SPMV) to expose further levels of task-parallelism that can be then exploited from OmpSs.This is part of ongoing work, together with the analysis of the communication patterns for s-step Krylov methods, and the use of asynchronous versions of MPI Recv.
(left); and assume y = [y 0 , y 1 , y 2 , y 3 | y 4 , y 5 | y 6 ], c = [c 0 , c 1 , c 2 , c 3 | c 4 , c 5 | c 6 ] denote partitions conformal to that of A in the same figure.In this case, to unleash a parallel execution, the triangular solve Ly = c is decomposed as

Figure 3 .
Figure 3. Left: Partitioning of the inconsistent matrix Â using three nested dissection steps and conformal partitioning of the inconsistent vector r.(For clarity, the variable names for the matrix blocks and subvectors, and some of the nonzero blocks in Â are not shown.)Right: Corresponding TDG consisting of 4 levels (Level 0-Level 3) and mapping of nodes/tasks to processes (P0-P7).

14 := r 0−3 14 + r 4− 7 14Figure 4 .
Figure 4. From top to bottom, sequence of messages and computations required to carry out the transformation r → r.Inside parenthesis: additional messages and operations to perform the transformation r → r in one step.

Figure 5 .
Figure 5. Detailed implementation of the classical formulation of the task-parallel ILU-PCG solver annotated with state and storage modes for the variables and communication patterns: v for consistent/partially replicated; v for inconsistent; and v for consistent/distributed.Note the reorganization of the code to enable the merge of S6 and S9.

Table I .
Matrices employed in the experimental evaluation.In the table nz counts the non-zeros in the upper triangular part only.
Figure 10.Scalability of the task-parallel ILU-PCG solvers using a single socket (8 cores) per node.