Computing the matrix sine and cosine simultaneously with a reduced number of products

A new procedure is presented for computing the matrix cosine and sine simultaneously by means of Taylor polynomial approximations. These are factorized so as to reduce the number of matrix products involved. Two versions are developed to be used in single and double precision arithmetic. The resulting algorithms are more efficient than schemes based on Pad\'e approximations for a wide range of norm matrices.


Introduction
Many dynamical systems are modeled by differential equations in which finding closed solutions is not possible and so one has to compute approximating solutions. These differential equations usually preserve some underlying geometric structure which reflects the qualitative nature of the phenomena they describe. It is then relevant that the approximations share with the exact solution of the differential equation these qualitative properties to render a description. The design and analysis of numerical integrators preserving some of these geometric structures constitutes the realm of Geometric Numerical Integration (GNI), an active and interdisciplinary research area and the subject of intensive development during the last decades [6,10,15,18,20,23].
Exponential integrators can be considered as a class of GNIs tailored to stiff and oscillatory equations [7,8,13,14,16]. For large systems of equations these schemes usually require to compute the action of the exponential of a matrix on a vector [13,14]. However, for problems of moderate size it may be more appropriate to compute directly the exponential of the matrices involved.
When the problem is oscillatory, very often the formal solution involves both the sine and cosine of a matrix. Thus, for example, consider the Schrödinger equation in quantum mechanics, where H(t) is a Hermitian operator and ψ is a complex wave function. A usual procedure to get numerical approximations involves first a spatial discretisation or working on a finite dimensional representation. In any event, one ends up with a matrix equation with a similar structure, If A is a real and constant matrix, the unitary evolution operator is given by There are different techniques to compute efficiently the exponential of a matrix [2,3,5,12,21,25,26,27]. However, using any of these general algorithms to approximate the unitary matrix e −itA in (1) involves products of complex matrices making them computationally expensive. Alternatively, we propose an efficient procedure to compute the matrix sine and cosine that only involves a small number of products of real matrices. The algorithm is used in combination with the squaring as In this way, it only requires two products per squaring (instead of four products when considering the square of complex matrices), thus making the overall procedure more efficient.
There are other examples where the computation of the sine and cosine of a matrix can be of interest. For example, for wave equations given by the generic second order system with y ∈ R N , exponential integrators frequently require to solve separately the linear homogeneous problem Writing (2) as a first order system, the solution is given by and Notice that the dimension of M is twice the dimension of A and so the cost of matrixmatrix multiplications grows, in general, by a factor of eight.
On the other hand, a closer look to the functions to be approximated clearly indicates that the same algorithm used to evaluate the matrix sine and cosine for the unitary matrix (1) should not be used directly since it requires computing first the square root of the matrix, B = √ A, in addition to a multiplication and an inversion of this matrix. As a matter of fact, an efficient approximation to the exponential (4) was already presented in [4]. We propose in this case an improved algorithm based on a modification of the methods to compute the matrix sine and cosine with the goal of computing simultaneously the functions c( For the double angle we will take into account that thus requiring only two products per squaring. Notice that we do not use the property cos(2A) = I − 2 sin 2 (A) since the function sin(A) is not computed in this case. In summary, the purpose of this paper consists in developing algorithms that allow one to compute cos(A) and sin(A) or c(t 2 A) and s(t, A) simultaneously and providing full accuracy up to single or double precision with a reduced computational cost. Thus, in particular, we propose an algorithm that, with only four products, approximates cos(A) with an error of order O(A 17 ), and with two extra products it also approximates sin(A) with an error of order O(A 18 ). The same procedure allows one, with one extra product (seven products in total), to approximate cos(A) and sin(A) with errors of order O(A 25 ) and O(A 24 ), respectively.
Although one can find in the literature several algorithms to compute cos(A) (see [25] and references therein), only few of them are designed to do so in a simultaneous way (see [1] and references therein). As our analysis shows and several numerical examples confirm, the technique we propose here outperform all of them.

The algorithms
The search of fast algorithms for evaluating matrix polynomials has received considerable interest in the recent literature [2,3,17,19,22,24,28,29]. We next briefly summarize how to approximate the matrix sine and cosine functions by means of certain polynomials involving a reduced number of matrix products. This reduction essentially follows the same approach used in [9] to minimise the number of commutators appearing in different Liegroup integrators and was successfully adapted to the Taylor expansion of the exponential matrix in [2] and especially in [3].
Generally speaking, the strategy consists first in elaborating a recursive procedure to compute the polynomial approximating the matrix cosine with the minimum number of products and then these same products are used to approximate the matrix sine as accurately as possible in the cheapest possible way.
Clearly, the most economic way to construct polynomials of degree 2 k is by applying the following sequence, which requires the evaluation of only k products. First we form the intermediate matrices . . . and finally we take Here the indices in A, A 2 k , are chosen to indicate the highest attainable power, i.e., Of course, there are many redundancies in the coefficients since some of them can be absorbed by others.
It is a simple exercise to check that any polynomial of degree up to four can be computed with two products, whereas polynomials up to degree eight can be computed with only three products. This does not mean, however, that all such polynomials can be written with just three products. This is the case, in particular, of P7(A) = A 7 , as can be readily seen. When a given polynomial cannot be reproduced by following the previous approach, new terms can be incorporated. Thus, in particular . . . and this generalises the procedure. We use this technique in the sequel to approximate first cos(A) and sin(A) simultaneously with the minimum number of products, and then we apply the same procedure to c(t 2 A) and s(t, A).

Computing cos(A) and sin(A) simultaneously
Let us denote by the Taylor polynomial approximations of cos(A) and sin(A) up to order 2m and 2m + 1 in A, respectively, and by T s 2m+1, with > 2m + 1, any polynomial of degree such that k = 3 products. This constitutes a trivial problem, but it nevertheless illustrates the general procedure. With two products we can compute T c 4 : and with one extra product we can get k = 4 products. With three products we can compute T c 8 : With one extra product we can approximate the matrix sine, but only up to order seven as follows According with the previous notation, T s 7,9 (A) = T s 7 (A) + O(A 8 ). The order of approximation of the matrix sine can be increase up to order nine by incorporating one extra product as follows: k = 6 products. The following scheme allows one to express T c 16 (A) with only four products: In fact, we get two families of solutions depending on a free parameter, x1, which is chosen to (approximately) minimize the 1-norm of the vector of parameters (x1, . . . , x8). This results in (13) Some of the coefficients are irrational numbers because they correspond to solutions of a nonlinear system of equations.
With two extra products we can approximate the matrix sine up to order O(A 18 ) as follows: with i.e. it approximates the matrix sine up to a higher order than the matrix cosine. k = 7 products. With five products we can compute T c 24 : The best solution we have obtained is: Although we report here 20 digits for the coefficients, they can be in fact determined with arbitrary accuracy.
With two extra products we can approximate the matrix sine up to order O(A 23 ) as follows: 2.2 Computing c(t 2 A) and s(t, A) simultaneously Let us denote by the Taylor expansions of the functions up to order m in A, respectively, with A a real matrix. Notice that they are approximations up to order 2m and 2m + 1 in t to the respective functions. Analogously, we will denote by P s m, , > m, any polynomial of degree such that P s m, = P s m + O(A m+1 ). Next we show how the previous algorithms to approximate the sine and cosine functions can be adjusted to approximate c(t 2 A) and s(t, A). As before, we proceed according with the number of products involved. k = 3 products. With two products we can compute P c 4 (t 2 A): With the same number of products we can also evaluate P s 3,4 (t, A), whereas with one extra product we get k = 4 products. With three products we can compute P c whose coefficients are the same as those given in (13).
with solution for the coefficients ai,j given in (17), whereas with one extra product we can approximate P s 11 (t, A) as with the same coefficients as in (19).

Padé approximations
At this point it is useful to briefly review the schemes presented in [1] to compute the matrix sine and cosine simultaneously, since they will be compared in section 4 with our own procedure. The methods presented in [1]    It is clear that s4, c4 can be computed simultaneously with 5 products (A 2 , A 4 , A 6 , A 8 , and the extra product for the numerator in s4) and the computation of two inverse matrices. Since both denominators are the same, only one LU factorization is necessary. The totals cost is (7 + 1 3 ) products. Notice that the same order (with very similar accuracy as we will see) is obtained with our novel approach at the cost of only 4 products (and a smaller number of matrices need to be stored).

Error analysis
Next we analyse how to bound the truncation errors of the previously considered Taylor polynomial approximations of order 2m and 2m + 1 for cosine and sine functions, respectively. They have the form On the other hand, the truncation errors of the approximations of the cosine and sine functions obtained by using Padé approximants for e iA [1] can be written as Clearly, the series (29) and (30) can be bounded in terms of A as and We denote by θ M 2m the largest value of θ such that the bounds (31), (32) do not exceed a prescribed accuracy, u, for each method M ≡ T c 2m , T s 2m+1 , cm, sm. To achieve maximum accuracy, we bound the previous forward absolute errors with the unit round off u = 2 −53 , u = 2 −24 in double and single precision floating-point arithmetic, respectively. We have truncated the series of the corresponding functions after 150 terms to find θ M 2m . The corresponding values for the new Taylor approximations of the cosine and sine functions are collected in Tables 1 and 2. For completeness, we also include the values of θ M 2m for the Padé approximations, as given in [1], and the total number of matrix products corresponding to each procedure Π2m. In the case of Padé approximants, we have added the cost of evaluating two inverse matrices sharing the same LU factorization, i.e (2 + 1 3 ) products, to the total πm.

Numerical experiments
We measure the performance of new Taylor polynomials (denoted as 'cosmsinmT') and the Padé approximations (denoted as 'cosmsinmP') [1] to compute matrix cosine and sine Table 2: Same as Table 3, but now in single precision floating-point arithmetic.  functions simultaneously. The platform of all numerical experiments is MATLAB R2013a and the matrix 1-norm has been used in implementing the algorithms. The experiments have been carried out for 2500 matrices (adjusted in order to have different norms) of the following cases: • 52 test matrices have been chosen from the MATLAB gallery function [11] (blue). 690 sampled matrices with different norms were tested.
• Using rand() and randn() functions in MATLAB to randomly generate matrices with entries drawn from different distributions. 400 matrices normally distributed, 500 matrices uniformly distributed in the interval (0, 1) and 501 matrices in the interval (−0.5, 0.5).
• Using spdiags() and rand() functions in MATLAB to construct 400 triangular nilpotent matrices with random rank (red).
The same test matrices have been generated as in Remark 5 of [3] and all matrices are adjusted to have 1-norms over (  from the Matrix Function Toolbox [11]. The reference solutions of the matrix cosine and sine have been calculated with Mathematica with 100 digits of precision. We have computed the relative error where F is an approximated value of f (A   profiles of the algorithms on a set of the test matrices exemplified for Figs. 2, 3 in terms of the relative errors, number of products and computational times. The performance plot shows the percentage of problems (y-axis) that are within a given factor (x-axis) of the best method [30]. In the experiments illustrated by the performance profiles in Fig. 4, the 2500 matrices of dimension ≤ 64 × 64 have been tested. We have observed that the cosmsinmT method has a lower relative error for 877 and 1257 of the 2500 matrices than the cosmsinmP method for computing the approximate values of the matrix cosine and sine functions respectively (358 and 348 results are equal). These results are evident from the Fig. 4 on the top. It is seen clearly from the bottom of Fig. 4 that the cosmsinmT method is less expensive than cosmsinmP.
The performance profiles in Fig. 5 resulted from demonstrating the returns from the 2500 matrices of dimension ≤ 1024 × 1024 confirm the superiority of the cosmsinmT method in the sense of computational cost.

Conclusions
We have presented a new algorithm to compute the matrix cosine and sine. The algorithm contains several methods that are optimised for different values of the norm of the matrix and the desired accuracy, and can be combined with the scaling and squaring technique. Each of these methods is obtained by following a sequence in which each stage uses the  results from all previous ones. An error analysis is also carried out and we have shown both theoretically as well as in the numerical experiments that the new algorithm is superior to other procedures from the literature that are based on Padé approximations to the matrix cosine and sine.
The new algorithm only involves matrix-matrix products and does not require to compute the inverse of matrices as it the the case of the Padé approximations. The cost to compute the inverse of a dense matrix can be taken as 4/3 the cost of the product of two dense matrices. However, for sparse matrices, the computational cost of the proposed algorithms grow nearly linearly while the cost of Padé approximations grows much faster because, in general, the inverse of a sparse matrix is a dense matrix.