Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

In this paper, we target the parallel solution of sparse linear systems via iterative Krylov subspace-based method enhanced with a block-Jacobi preconditioner on a cluster of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the preconditioned conjugate gradient method that improve the interoperability between the message-passing interface and OmpSs programming models. Specifically, we progressively integrate several communication-reduction and iteration-fusing strategies into the initial code, obtaining more efficient versions of the method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 32 nodes with 24 cores each. The experimental analysis shows that the techniques described in the paper outperform the classical method by a margin that varies between 6 and 48%, depending on the evaluation.


Introduction
Linear systems appear in a myriad of applications, including fundamental numerical simulations of physical phenomena as well as in recent methods for data analytics [11]. When the coefficient matrix of the linear system is large and sparse, iterative methods based on Krylov subspaces, enhanced with some sort of preconditioner, often provide an efficient means to solve these type of linear algebra problems, avoiding the fill-in memory issues of their direct counterparts [15].
The conventional formulations of Krylov subspace methods (KSMs) such as conjugate gradient (CG), biconjugate gradient (BiCG), biconjugate gradient stabilized (BiCGStab) and generalized minimal residual method (GMRES) [15] present several "synchronization points" per iteration. In consequence, as technological advances widen the gap between computational performance and communication costs of computer architectures [8,12], the performance of KSMs suffers as the data dependencies and the memory-bound nature restrict the communication overhead that can be hidden [17]. In response, in the last years, there has been an intensive research aimed to reduce the negative impact of synchronization bottlenecks in KSMs. The result has been a number of communicationreduction techniques, hierarchical and "enlarged" reformulations of the solvers, iteration-fusing variants, s-step methods ("communication-avoiding"), pipelined schemes ("communication-hiding"), etc. An up-to-date source of references to these variants can be found in [6].
Our work builds upon a number of previous papers that address the task-parallel implementation of KSMs on multicore architectures and clusters of multicore processors. First, the authors of [1] proposed a parallel implementation of a CG solver, enhanced with a sophisticated ILUPACK preconditioner, that leverages MPI and OmpSs [13,14] to improve the performance of a pure MPI-based solution; and this approach was then generalized to other types of incomplete LU (ILU)-based preconditioners and communication-reduced variants of CG in [2]. Independently, the authors of [18] presented an iteration-fusing variant of the pipelined CG [9], for multicore processors, that combines a task-parallel re-formulation of the method with a relaxation of the convergence test in order to break the strict barrier between consecutive iterations of the method.
In this paper, we extend the iteration-fusing techniques to carry over to an MPI + OmpSs implementation of the pipelined CG solver for clusters of multicore processors. Specifically, our work makes the following contributions: • We develop message-passing, task-parallel implementations of the CG method, preconditioned with a block-Jacobi (BJ) strategy, that exploit a "hybrid" programming solution which combines MPI and OmpSs. (We will refer to this method as BJCG hereafter.) • We integrate several strategies into the classical BJCG to reduce the number of synchronization points, increase task asynchronism and diminish the communication overhead. In particular, our message-passing task-parallel implementation of the pipelined BJCG solver reduces the number of synchroniza-tion points to only two per iteration. Furthermore, it overlaps communications and computations, making possible to execute in parallel tasks from two consecutive iterations. • We examine in detail the communication and synchronization patterns, as well as the dependencies between tasks, appearing in these parallel implementations. • For all these implementations, we perform a comparative analysis of their weak scalability on a cluster consisting of 32 nodes, using a 24-core socket per node.
The rest of the paper is structured as follows. Section 2 offers a brief review of the BJCG solver, and Sect. 3 presents an initial message-passing, task-parallel parallelization version of BJCG. In Sect. 4, we describe a number of variants of BJCG that increase the task asynchronism of the initial implementation. In Sect. 5, we analyze the performance and scalability of these different parallel versions. Finally, Sect. 6 summarizes our work and offers a few concluding remarks.

Preconditioned CG solver
Let us consider the linear system Ax = b , where the coefficient matrix A ∈ ℝ n×n is symmetric positive definite (s.p.d.) and sparse, with n z nonzero entries; b ∈ ℝ n is the right-hand side vector; and x ∈ ℝ n is the sought-after solution vector. Figure 1 offers an algorithmic description of the iterative preconditioned CG method. The most time-consuming operations in this algorithm are the computation of the preconditioner (before the iteration commences), the sparse matrix-vector product (SpMV) (at each iteration) and the application of the preconditioner (M) (also once per iteration). The remaining operations are scalar computations or simple vector kernels such as the dot product (dot) and axpy-type vector updates [4]. For simplicity, in our implementation of the CG method, we integrate the Jacobi preconditioner [15]. In particular, consider a decomposition of the coefficient matrix Fig. 1 Formulation of the preconditioned CG solver annotated with computational kernels. The threshold max is an upper bound on the relative residual for the computed approximation to the solution. In the notation, ⟨⋅, ⋅⟩ computes the dot (inner) product of its vector arguments A that extracts its diagonal blocks into D = (D 1 ,D 2 , … ,D S ) , where D s ∈ ℝ m s ×m s , s = 1, 2, … , S , and n = ∑ S s=1 m s . (In the classical Jacobi preconditioner, m s = 1 for all s = 1, 2, … , S , so that D only contains the diagonal entries of A.) The block-Jacobi preconditioner defined by the block-diagonal matrix M = diag(D) ∈ ℝ n×n is particularly effective if the structural blocks D s reflect the nonzero block structure appearing in the coefficient matrix A, as it is the case for many linear systems that arise from a finite element discretization of a partial differential equation (PDE) [3]. To achieve this, we can leverage a supervariable agglomeration heuristic to determine the block-diagonal structure of the block-Jacobi preconditioner [5]. This procedure captures the block structure of the coefficient matrix in the blocks of the preconditioner, with a given upper bound on the blocksize.
An additional appealing property of the block-Jacobi preconditioner is that its computation can be realized via a straightforward extraction of the structural blocks from the coefficient matrix A, followed by an explicitly independent inversion of each one of these blocks: The application of the preconditioner then boils down to a simple (dense) matrix-vector multiplication per block M s . Alternatively, one can extract the structural blocks and then decompose each into the product of two triangular factors using, for example, the Cholesky factorization [10]: D s = L s L T s , where L s ∈ ℝ m s ×m s is the lower triangular. The application of the preconditioner then requires the solution of two triangular linear systems per block, with the factor L s and its transpose L T s . For simplicity, hereafter, we will drop the superindices that denote the iteration count in the variable names. Thus, for example, x (l) becomes x, where the latter stands for the storage space employed to keep the sequence of approximations x (0) , x (1) , x (2) , … , computed during the iterations.

Message-passing BJCG solver
In this subsection, we perform a communication analysis of a basic message-passing implementation of the BJCG solver with the following considerations: 1. The parallel platform consists of np processes, denoted as 1 , 2 , … , np . Without loss of generality, we will assume a message-passing algorithm that operates in a distributed-memory platform. (In the MPI context, the term process refers to an MPI rank.) 2. The coefficient matrix A is decoupled (i.e., partitioned) into np blocks of rows (i.e., one row-block per process):

3
Iteration-fusing conjugate gradient for sparse linear systems… with the kth distribution block A k ∈ ℝ n k ×n stored in k , and ∑ np k=1 n k = n . This block-row distribution is particularly appealing when the sparse matrix is maintained in CSR (compressed sparse row) format [15]. 3. Unless otherwise stated, vectors are partitioned and distributed conformally with the block-row distribution of A. For example, the residual vector r is partitioned as and r k ∈ ℝ n k is stored in k . 4. The partitioning of A into np row-blocks also induces a conformal partitioning by blocks on the preconditioner: where M k ∈ ℝ n k ×n k is stored in k . For simplicity, we assume that each one of the distribution blocks M k contains an integer number of structural blocks of the preconditioner M = diag(D 1 ,D 2 , … ,D S ) . In other words, the entries of the structural block D s are all mapped to a single distribution block M k . Note that this can always be achieved by slightly adjusting the values of the distribution block sizes n k . 5. The scalars , , are replicated in all np processes.
Let us next analyze the individual operations (or computational kernels) S1-S8 comprised by the loop body of a single BJCG iteration, as shown in Fig. 1.
Sparse matrix-vector product (S1) The input operands to this kernel are the coefficient matrix A, which is distributed by blocks of rows, and the vector d, which is partitioned and distributed conformally with A. Therefore, in order to compute this kernel, we first obtain a replicated copy of the distributed vector d in all processes, denoted as d → e . (Vector e is the only array that is replicated in all processes. In contrast, the remaining vectors are all distributed.) After this communication stage, each process can then compute its local piece of the output vector w concurrently: This kernel thus requires collecting the distributed pieces of d into a single vector e that is replicated in all processes (in MPI, for example, via an MPI_Allgatherv). The computation can then proceed in parallel, yielding the result w in the expected distributed state with no further communication involved. At the end, each MPI rank owns a piece of w.
dot products (S2, S6, S8) The next kernel in the loop body is the dot product S2 between the distributed vectors d and w. Here, it is easy to derive that a partial result can be first computed concurrently in each process: after which, these intermediate values have to be reduced into a globally replicated scalar ∶= ∕( 1 + 2 + ⋯ + np ) . The same idea applies to the dot products in S6 and S8, yielding a total of three process synchronizations due to dot products (in MPI, via MPI_Allreduce). axpy(-type) vector updates (S3, S4, S7) Consider now the axpy kernel in S3, which involves the distributed vectors x, d and the globally replicated scalar . All processes can perform their local parts of this computation to obtain the result without any communication: The same applies to S4 and given that , ′ are globally replicated, also to S7.
Application of the preconditioner (S5) The kernel in S5, where the input preconditioner M is distributed by blocks of rows as A, and the input vector r is distributed conformally with the same matrix, can be performed concurrently in all processes, with This kernel should exploit that M k consists of multiple structural blocks. Thus, assuming, for example, that M 1 = (D 1 ,D 2 ,D 3 ,D 4 ), with conformal partitionings r 1 = (r 1 ,r 2 ,r 3 ,r 4 ) , z 1 = (ẑ 1 ,ẑ 2 ,ẑ 3 ,ẑ 4 ) , the first process computes Summary The previous elaboration exposes several insights for the message-passing BJCG solver: 1. Prior to each SpMV, we need to gather a copy of the distributed vector d within all nodes, implying a process synchronization at the beginning of each iteration. 2. Each dot product requires a global reduction and, therefore, a process synchronization point. The loop body of the BJCG solve can be re-arranged so that S8 is pushed up next to S6. A simultaneous execution of these two reductions then decreases the number of process synchronizations, due to global reductions/dot products, from 3 to 2 per iteration in the loop body of the BJCG solver. 3. The axpy(-type) kernels and the application of the block-Jacobi preconditioner do not involve any sort of communication. 4. All (non-scalar) data are distributed among the processes, except for the replicated copy of vector d in e.
The reorganized algorithm and communications are summarized in Fig. 2.

3
Iteration-fusing conjugate gradient for sparse linear systems…

Task-parallel BJCG solver
When the target platform is a cluster of multicore processors, from the performance perspective, it often pays off to expose an extra level of parallelism, which can then be exploited within each node of the cluster using, for example, OpenMP or OmpSs. This can be achieved by leveraging the loop parallelism within the kernels S1-S8. Alternatively, the same goal can be pursued by explicitly dividing each kernel into a collection of finer-grain operations, or tasks, to then exploit this type of task parallelism. This second option, analyzed in the iteration-fusing CG scheme presented in [18] and the ILU-preconditioned CG solver in [2], aims to avoid the introduction of a thread-synchronization point after each kernel, enabling a partially concurrent execution of two or more kernels, as explained next.
In the following analysis, we will only consider kernels S1-S2 and S4-S7 in the loop body of the BJCG solver. (See Fig. 2.) For simplicity, we merge the execution of S3 with that of S4; and S8 with S6. The operations in the solver are then intertwined by a series of data dependencies which, in principle, dictate a strict order of execution: (The name on top of each dependency arrow indicates the variable that generates the corresponding dependency.) However, as described next, exploiting task parallelism enables a overlapped execution where some of these kernels can be (partially) computed concurrently (denoted with the symbol " ‖"), breaking the strict inter-kernel barriers due to the dependencies; in particular, we aim to attain a parallel execution where S1 ‖ S2 and S4 ‖ S5 ‖ S6. Sparse matrix-vector product S1 ‖ dot product S2 For the SpMV, we can partition the operands local to process k as Here, we can consider each one of the small SpMV operations Ã i,j e j as a fine-grain task, where the specific sparsity pattern of A k determines the dependencies between an output block w i and the input e. This yields a global task dependency graph (TDG) for SpMV that reflects the sparsity pattern of A. Consider, for example, the SpMV in Fig. 3, where I = 3 and J = 5 and there are 7 blocks with nonzero entries in A k . Now, due to the particular nonzero block structure of A k , we note that, for example, w 1 = ∑ J j=1Ã 1,j e j =Ã 1,1 e 1 +Ã 1,4 e 4 , yielding the dependencies {e 1 , e 4 } →w 1 . Here, we consider that each one of the matrix-vector products Ã 1,1 e 1 and Ã 1,4 e 4 can be annotated as a task, producing a partial result which finally have to be accumulated (in a serialized update) in order to obtain w 1 .
For the dot product S2, the computation local to k can be decomposed into I tasks, which can be calculated in parallel by partitioning the input operands d k , w k into I pieces, with each task producing a partial result ̃i: (Note the difference between the partitioning of the full replicated vector e ∈ ℝ n into J blocks, involved in the local SpMV, and that of the local piece of vector w as w k = (w 1w2 …w I ) ∈ ℝ n k appearing in the previous dot product.) These partial results are then reduced to produce a single value local to the kth node, k ∶= ∑ I i=1̃i , and these local values are next further reduced across all np nodes to produce the globally replicated scalar ∶= ∑ np k=1 k . The key here is that, although there exists a strict dependency S1 → S2, by breaking these two operations into fine-grain tasks, the execution of some tasks of the second kernel can commence as soon as the corresponding results of the former one are available, yielding a partially concurrent execution of these two tasks. Unfortunately, the global reduction required at the end of S2 imposes a task/process synchronization point that impedes to extend this idea beyond that point. Iteration-fusing conjugate gradient for sparse linear systems… axpy vector update S4 ‖ preconditioner application S5 ‖ dot product S6 A similar division of the three kernels S4-S6 into fine-grain tasks permits their concurrent execution, but again a task/process synchronization appears right after S6. axpy vector update S7 and SpMV S1 (subsequent iteration) The convergence test together with the need to perform the replication d → e , at the beginning of each iteration, introduces a process synchronization that prevents the parallel execution of the local tasks corresponding to these two kernels.

Implementation using MPI + OmpSs
Once we have analyzed the parallelism that the application offers, in this subsection, we describe how to exploit it via a combination of two parallel programming interfaces: MPI and OmpSs. Following the OmpSs programming model [14], in our task-parallel implementations, tasks are annotated with the directive #pragma oss task, using clauses to specify the operands' directionality as input (in), output (out) or both (inout). For example, the routine for a dot product, calculating ∶= x T y , x, y ∈ ℝ q , is annotated as while, for a routine calculating the axpy y ∶= y + x , we have As already hinted during the presentation of the message-passing task-parallel BJCG solver, the replication of vector d into e is performed across the processes using the MPI collective MPI_Allgatherv. To ensure that an updated version of d is fed into the MPI collective, we introduce a task barrier, using the OmpSs directive #pragma oss taskwait, before the invocation of the communication primitive. This enforces that all tasks up to that point are completed before the communication (in that process) is allowed to proceed, creating a task synchronization point which is leveraged to perform the convergence test ( > max ?) right after it. This is followed by an MPI synchronization across processes implicit in the MPI collective primitive. The global reductions are realized via MPI_Allreduce. Similar to the previous case, we insert a #pragma oss taskwait on the specific variable being reduced prior the invocation to the MPI collectives for the reduction, so that all tasks operating on that variable have been completed before the reduction across nodes can begin. Furthermore, the results from each reduction task (e.g., ̃i for S8) are accumulated into the local result ( k ) using atomic updates. Figure 4 summarizes the previous elaboration focusing on the operations computed by process k , during iteration l, using a simple example with the sparse coefficient matrix defined as in Fig. 3. Note the partitioning of the kernels into multiple tasks, with I = 3 , J = 5.

Increasing task asynchronism in the BJCG solver
The analysis of the parallel BJCG method in the previous section reveals three synchronization points per iteration. As the degree of hardware concurrency grows, these synchronizations can become more expensive than some of the computational  Iteration-fusing conjugate gradient for sparse linear systems… kernels, impairing the parallel scalability of the method; see, for example, [7,9] In this section, we focus on the pipelined formulation of the CG solver proposed in [9] and its iteration-fusing task-parallel variant in [18], analyzing how to reduce synchronization points in a message-passing implementation.

Pipelining
Message-passing pipelined BJCG solver Figure 5 displays the pipelined BJCG solver. Let us make the same assumptions as those stated in the presentation of the message-passing BJCG solver in Sect. 3.1; that is, A, M are both partitioned and distributed by blocks of rows, all vectors are partitioned/distributed conformally unless otherwise stated, and the scalars are globally replicated. Then, we can easily derive that the number of synchronizations has been reduced to only two: • The gathering of the distributed vector y into the globally replicated copy e prior to the SpMV in S2.1 and S2.2, respectively. • A combined global reduction for the dot products in S11, S12, S13.
In this case, the convergence criterion can be tested right after the computation of , in S13.
Task-parallel pipelined BJCG solver Exploiting two levels of parallelism, by combining MPI with OpenMP or OmpSs, is also possible in the pipelined variant of the method. The dependency pattern in this algorithm can be expressed in the form of a (simplified) TDG as shown in Fig. 6.
For clarity, in this case, we merge all axpy(-type) updates into a single kernel (S10) and the three dot products into a single one as well (S12). The purpose of dividing the kernels into fine-grain tasks is to overlap (execute partially in parallel) tasks from different kernels. In the pipelined algorithm, the preconditioner application (S1) is immediately followed by the gathering of vector d (S2.1), in preparation for the SpMV. This introduces a synchronization point which impedes a concurrent execution of these two kernels. In contrast, by dividing the SpMV itself (S2.2) into tasks, these can be executed in parallel with some of the tasks for the subsequent axpy (S10), (as well as tasks for other axpy vector updates). Furthermore, a task parallelization also allows a partially concurrent execution of the axpy and the final dot product (S12). In summary, S2.2 ∥ S10 ∥ S12. Unfortunately, after the combined reduction, there appears a new synchronization point implicit to the communication, which is leveraged to test the convergence criterion. Figure 7 offers simplified TDGs of the classical and pipeline versions of the BJCG solver (left and right, respectively), exposing which kernels can proceed in parallel by dividing their execution into multiple tasks. Comparing both TDGs, we can conclude that the classical version comprises less vector operations than the pipelined version, but the first one requires an additional synchronization step.

Implementation using MPI + OmpSs
The codes for this scheme annotate tasks using OmpSs directive/clauses. Furthermore, they employ MPI collective primitives for the replication of vector d (MPI_Allgatherv), just before the SpMV, and the global reduction (MPI_Allreduce) of the scalars , , , at the end of the iteration. As in the task-parallel implementation of the plain CG method, an intraprocess task synchronization is enforced, via the introduction of a #pragma oss Iteration-fusing conjugate gradient for sparse linear systems… taskwait, before the MPI_Allgatherv collective. This enforces that all operations on d have been completed before its replication into e. Similarly, the insertion of #pragma oss taskwait directives in all three the scalars which are reduced before MPI_Allreduce guarantees the correct behavior of these operations.

Iteration fusing
In [18], the authors propose a variant of the pipelined scheme, named iteration-fusing CG (IFCG), that enables the concurrent execution of kernels from consecutive iterations on multicore processors. Specifically, their variant IFCG1 decomposes all kernels into fine-grain tasks, such as we described in previous sections, and relaxes the periodicity of the convergence test, to perform it only every t iterations. For those iterations, where the test is not checked, on a shared-memory processor, this allows that the dot products at the end of one iteration proceed in parallel with the application of the preconditioner and SpMV at the beginning of the subsequent iteration, as there exist no dependencies between them and there is no need for MPI communication primitives; see   Fig. 7 Simplified dependencies between kernels in the classical BJCG solver and the pipelined variant (left and right, respectively). In both cases, a single iteration comprising the operations/kernels between two consecutive convergence tests is shown Our message-passing implementation of the pipelined CG scheme already decomposes the iteration kernels into tasks of fine granularity. At first sight, it could appear that relaxing the periodicity of the iteration test could then unleash the same "iteration-fusing effect" on a message-passing architecture. Unfortunately, the MPI primitive for the (combined) global reductions at the end of the iteration hinders this, as collective routines are blocking. Moreover, the same problem appears in the SpMV, as it is necessary to replicate vector d before this operation can commence.
An easy way to solve the previous issues would be to use tasks to implement both the computation and communication phases, relying on task dependencies to deal with internode and intra-node synchronizations. However, this approach cannot be efficiently implemented with the current MPI and OpenMP specifications. Concretely, MPI provides the MPI_THREAD_MULTIPLE mode, which supports the concurrent invocation of MPI calls from multiple threads, but this is not sufficient to efficiently support task-based programming models such as OmpMP or OmpSs-2. The main issue is that tasks are not aware of the synchronous MPI primitives, which might block not only the task but also the underlying hardware thread that runs it. Even if the MPI implementation does not rely on busy waiting to check for operation completion and the hardware thread becomes idle, the task runtime has no means to discover that the hardware thread is available without an explicit notification from the MPI side. Without this notification mechanism, if the number of blocked in-flight MPI operations reaches the number of hardware threads, the application will suffer a deadlock. With the current specification of MPI, it is the responsibility of the application developer to avoid this situation. However, this severely constrains the ability of application developers to fully benefit from task-based programming models.
The Task-Aware MPI (TAMPI) library [16] overcomes these limitations by introducing a new MPI threading level called MPI_TASK_MULTIPLE. When MPI is initialized with this new threading model, each time a synchronous MPI operation blocks, instead of waiting until the operation completes, the TAMPI library notifies the runtime, so that it can schedule another task, avoiding any potential deadlock and improving CPU utilization. Using the TAMPI library, we can implement both computation and communication phases using tasks. This relies on fine-grained synchronizations to naturally overlap both phases and avoiding intra-node and internode global synchronization points. Therefore, using TAMPI, we can break the MPI_Allgather routine into MPI_Send and MPI_ Recv tasks, with the purpose of overlapping computation and communication. In this case, each MPI process sends its part of the vector to the others, and when an MPI process receives one part of the vector, it can start the computation with that part, without the need of waiting for all the vector, as was the case without TAMPI. Furthermore, the global reductions can be also overlapped with some computations.
Iteration-fusing conjugate gradient for sparse linear systems… Fig. 8 Execution trace of the three versions of task-parallel BJCG solver in two nodes, using 1 MPI rank per node and 4 OmpSs threads per MPI rank Figure 8a provides a trace, obtained with the Extrae+Paraver tools, 1 that displays the execution of one iteration of the task-parallel BJCG solver using 2 MPI ranks and 4 OmpSs threads per MPI rank. (Details on the parallel platform are given in Sect. 5.) We can distinguish there the sequence of operations detailed in Fig. 2, consisting of the initial MPI_Allgatherv which replicates the contents of vector d (S1.1); followed by the (local part of) SpMV (S1.2), and a dot product (S2); the MPI_Allreduce for the communication stage of the latter; and the final part composed of two combined axpy(-type) vector updates (S3 and S4), the preconditioner application (S5), two dot products (S6 and S8), with the corresponding communication and an axpy(-type) vector update (S7). Figure 8b offers the trace of a single iteration of the task-parallel pipelined BJCG solver, employing the same configuration as that in Fig. 8a. In this case, the sequence commences with the application of the preconditioner (S1); this is directly followed by the MPI_Allgatherv (S2.1), the (local) SpMV (S2.2) and the axpy(-type) vector updates (S3-S10), and finally, the sequence is completed with the three combined dot products (S11-S13) and the corresponding MPI_Allreduce.

Analysis of the execution traces
These two traces expose that the SpMV and preconditioner application dominate the execution time of the iteration, but the cost of the vector operations is not negligible. Furthermore, the traces identify several synchronization points: one prior to the SpMV, to replicate vector d, and some additional ones imposed by the dot products. Overall, this yields three synchronization points per iteration for the classical BJCG solver compared with only two for the pipelined variant. The negative impact of these synchronizations comes from the communication overhead and workload imbalance that leads to idle waits and waste of computational resources. Figure 8c shows a trace for the execution of several iterations (each marked with a different color) of the iteration-fusing BJCG solver. This visual result confirms that the execution of some of the kernels of distinct iterations can be overlapped in time, confirming that the synchronization points that were visible in the classical and pipelined BJCG have been removed with this technique.

Setup
The experiments in this section employed IEEE754 double-precision arithmetic and were carried out on the MareNostrum4 (MN4) supercomputer at Barcelona Supercomputing Center. This platform consists of SD530 Compute Racks with an Intel Omni-Path high-performance network interconnect. Each node includes two 24-core Intel Xeon Platinum 8160 processors (2.10 GHz) and 96 GBytes of DDR3 RAM. We performed our evaluation on up to 32 compute nodes of this infrastructure, using a single socket per node in the experimental evaluation (24 cores). The platform runs the SuSE Linux Enterprise Server operating system. The codes were compiled using Mercurium C/C++ (version 2.1.0), with MPICH (version 3.2.1).
Other software included OmpSs-2 based on the Nanos 6 Runtime Library; MKL (version 2017.4) for the vector kernels and preconditioner application; and Extrae+Paraver (versions 3.5.1 + 4.7.2) to obtain and visualize execution traces.
For the experimental analysis, we leveraged a sparse s.p.d coefficient matrix arising from the finite difference method for a 3D Poisson's equation with 27 stencil points. The fact that the vector involved in the SpMV kernel has to be replicated in all MPI ranks constrains the size of the largest problem that can be solved. Therefore, in order to analyze the weak scalability of the method, we permuted the original matrix, transforming it into a band matrix, where the size of the band depends on the number of nodes employed in the evaluation. Concretely, the size of the upper and lower bandwidth (BW) was set to 100×#nodes, taking values from 100 to 3200 as the number of nodes increases from 1 to 32. With this adjustment, we can maintain the size of the matrix to n = 8,000,000 rows/columns, but still increase its bandwidth (and therefore, the computational workload) proportionally to the hardware resources. The sparsity degree of the coefficient matrix is not greatly affected by this modification because it still remains around 0.08%.
For the linear systems, each entry of the right-hand side vector b was initialized with the sum of the nonzero elements of the corresponding row in the matrix, in order to verify a correct solution with all entries set to ones, and the PCG iterate was started with the initial guess x 0 ≡ 0 . The parameters that control the convergence and the FUSE steps of the iterative process (see [18]) were, respectively, set as = 1.0E-6 and = 20.

Performance analysis
We next evaluate the performance of the task-parallel BJCG solvers, focusing on the effects on performance of the synchronization-reduction techniques (pipelining and iteration fusing), compared with the classical BJCG algorithm. Our complete experimentation included many configurations of MPI ranks and OmpSs threads. Concretely, given the node architecture, we employed 1, 2, 3, 4 or 6 MPI ranks per node and a number of OmpSs threads that filled all the socket cores of the node, that is, 24, 12, 8, 6 or 4, respectively. For the sake of brevity, in the following, we only present the results of the best configuration for each algorithm and number of nodes. Figure 9 reports the execution time per iteration (averaged for 10 different executions) of the task-parallel BJCG solvers when we vary the number of nodes from 1 to 32, using a single socket per node. Specifically, we present a weak scaling evaluation, so that we set the number of nonzeros of the sparse matrix ( n z ) to be roughly proportional to the number of cores, increasing the size of the band of the matrix, as discussed in Sect. 5.1. At this point, we note that n z only offers an estimation of the computational cost, while the communication, among other factors, may play a significant role. Therefore, although we can assume that the amount of data per node is the same, we cannot expect that the execution time will be maintained when we increase the resources proportionally to n z . This is confirmed by the plot in Fig. 9, which exhibits similar execution times for the algorithms executed in 1, 2, 4 and 8 nodes, but a considerable increase in this metric when the number of nodes augments to 12, 16, 20, 24 and 32, in principle due to the communication and synchronization overheads.
The comparison of the execution times of the distinct BJCG solvers reveals significant differences, especially as we increase the number of nodes. On the one hand, for 4 and 8 nodes (96 and 192 cores, respectively), there appear small time variations between the methods, though we can observe that the application of the interoperability offers some small benefits (around 6%). On the other hand, for 12 to 24 nodes (288 to 576 cores), the difference is more notable (around 15%); for 32 nodes (768 cores), the time difference between the classical and pipelined BJCG becomes remarkable, exposing the positive effects of exploiting the interoperability provided by the TAMPI library for this particular application. At that point, the IFCG algorithm outperforms the others by 48%.
We next analyze the impact of increasing the number of the MPI ranks and, consequently, the volume of communication, for the different algorithms. Concretely, Fig. 10 displays the iteration time of the solvers when they are executed using 32 nodes, with 1, 2, 3, 4 or 6 MPI ranks per node, and adjusting the number of OmpSs threads in order to maintain the use of 24 cores per node. We observe that the execution time of the classical and pipelined BJCG grows with the number of MPI ranks, while, in contrast, the execution time of IFCG is maintained or even decreases. Thus, in this case, the use of the interoperability layer of the TAMPI library counterbalances properly the increment in the communication volume when there are several MPI ranks per node.
Finally, we evaluate the strong scaling of the task-parallel versions of the BJCG solver. In order to perform this analysis, the problem size and the bandwidth are Iteration-fusing conjugate gradient for sparse linear systems… fixed, respectively, to n = 12,000,000 and bandwidth = 400 , whereas the resources are increased from 4 nodes/96 cores to 32 nodes/768 cores. The reason for starting setting the baseline to 4 nodes instead of a single node is that the maximum problem size that fits in one node is too small to obtain good scalability when the number of nodes is large, as the computation-to-communication ratio is rather small. Figure 11 shows the execution time per iteration of the solvers averaged for 10 different executions. In general, as expected, there is a decrease in the iteration time as the number of cores grows. If we compare the different versions, the results demonstrate that the IFCG variant consistently outperforms the other two versions, by a margin that is around 8-15%.

Concluding remarks
In this paper, we have presented a message-passing task-parallel implementation of the pipelined BJCG that, in practice, reduces the number of synchronization points to only two (per iteration). To attain this goal, our iteration-fusing variant relaxes the periodicity test in the iteration while leveraging the TAMPI library to eliminate the negative blocking effects of collective communication primitive. The combination of this relaxation technique with TAMPI unleashes a concurrent execution of the tasks from two consecutive iterations which notably increases the parallel performance of the proposed message-passing task-parallel pipelined BJCG solver. Concretely, our experimental results using up to 32 nodes (and 768 cores) show a speedup with respect to the classical BJCG and a plain pipelined version that consistently grows with the number of nodes (cores).
As part of future work, we plan to analyze alternative implementations of SpMV that avoid the replication of data and can be used to experiment with larger number of nodes or platform with a reduced amount of RAM per node.