Application of Multi-core and GPU Architectures on Signal Processing: Case Studies

Alberto González, José A. Belloch, Gema Piñero, Jorge Lorente, Miguel Ferrer, Sandra Rogeri, Carles Roig, Francisco J. Martínez, María de Diego, Pedro Alonso, Víctor M. García, Enrique S. Quintana-Ortí, Alfredo Remón and Antonio M. Vidal

Correspondence author: agonzal@dcom.upv.es

Audio and Communications Signal Processing Group (GTAC) iTEAM, Universidad Politécnica de Valencia Department of Information Systems and Computation (DISIC) Universidad Politécnica de Valencia Department of Computer Science and Engineering (ICC) Universidad Jaime I de Castellón

Abstract

In this article part of the techniques and developments we are carrying out within the INCO2 group are reported. Results follow the interdisciplinary approach with which we tackle signal processing applications. Chosen case studies show different stages of development: We present algorithms already completed which are being used in practical applications as well as new ideas that may represent a starting point, and which are expected to deliver good results in a short and medium term.

Keywords: Multi-core/GPU Architectures, Structured linear systems, FFT, Convolution, MIMO detection, LDPC codes, Array processing, Adaptive algorithms.

1. Introduction

INCO2 [1] is a group of excellence in the Comunidad Valenciana (Spain), recognized as such by the local government through the PROMETEO 2009/013 project award. The members of the INCO2 group address problems arising in Signal Processing applications from an interdisciplinary perspective, designing solutions based on high performance hardware and developing algorithmic techniques with a modern and advanced software conception. In [2], both the architectural design and programming models of current general-purpose multi-core processors and graphics processors (GPUs) were covered, with the goal of identifying their possibilities and impact on signal processing applications. Probably the best form of appreciating the effect of these new architectures on signal processing is to analyze the performance attained by multi-core/GPU architectures in the solution of a variety of applications. As a natural continuation of that work, in this paper we present several case studies that show how parallelization on multi-core/multi-core architectures can be applied to specific problems and the outcome of it.

The rest of the paper is organized as follows. In Section 2 we show how to parallelize a detection method for MIMO digital communications systems on multi-core architectures. An evaluation of several packages to compute the FFT is presented in Section 3. Section 4 is devoted to the solution of Toeplitz linear systems on GPU and the parallelization of a beamforming algorithm for microphone arrays in Section 5. In Section 6 adaptive algorithms in digital signal processing systems with parallel convex combinations are presented. We dedicate Section 7 to present two potential applications to be developed in GPU in the near future by INCO2: Multichannel convolution and the decoding of LDPC codes. Finally some concluding remarks are reported in Section 8.

2. Direct search methods for MIMO Systems

An emerging technology for communication is transmitting through many input and output systems, which are known as MIMO systems [3]. This technology provides, among other advantages, an increase in the bandwidth and reliability of communications [4]. In this section, we will focus on the efficient detection of digital symbols transmitted through a MIMO system.

A wireless MIMO communication can be modeled by a system composed of M transmitting antennas and N receiving antennas. A complex signal s = [s1, s2, ..., sM]T is sent, where the real and imaginary parts of each component belong to a discrete and finite set A (the constellation or alphabet), and a signal x ∈ C0 is received. Signal x is a linear combination of the transmitted signal s, perturbed with additive white Gaussian noise v ∈ C0, therefore, x can be written as x = Hs + v, where the entries of the channel matrix H are complex. The optimal or maximum-likelihood (ML) detection of the sent signal means that, for each signal, the following discrete minimization problem must be solved: minx||s - Hx||2||, further details about MIMO detection can be found in [4].

When the dimensions of the problem and/or the size of the constellation grow, the computation of the optimal solution becomes very expensive [5]. In response to this, many heuristic techniques have been examined as alternatives. Our research group has studied the application of parallel computing to the different existing solvers. An approach is to use standard discrete minimization algorithms and adapt them to the problem, such as the Direct Search methods, which were first described in [6] and more recently revisited in [7]. These methods can be parallelized with two different goals in mind, following a common practice, we could use parallelism to reduce the computing time; alternatively, it can also be used to increase the probability of obtaining the optimal (ML) solution. This can be achieved by performing several searches in parallel using different initial points. We have adapted these methods to the MIMO detection problem, first with sequential versions and later with parallel versions of the sequential algorithms [8].

One of the most popular techniques for MIMO decoding is the Sphere Decoding algorithm [9]. This algorithm restricts the search to a sphere centered in the received signal x and with a given radius; it can be described as a search in a tree of solutions. The parallelism in this case can be exploited by assigning different branches of the tree to different processors. Several versions of this algorithm have been parallelized by the authors, using different parallel schemes, and different technologies (OpenMP and a hybrid method) [10]. The different versions were tested in a multi-core cluster composed of two PRIMERGY RX1600, each one with four Dual-Core Intel Itanium2 processors (1.4 GHz, 4 GB of shared RAM). The versions were tested with different problems, of increasing size (total number of nodes in the solution tree). The result is reported in terms of speed-up, which is the ratio between the time obtained with p processors and the best execution time obtained using a single processor. Figure 1 shows the speed-up attained with the parallel version based on OpenMP.

For all the problems tested, the best speedup is achieved with six processors; compared with the time consumed by the serial version (one processor), the execution time is reduced by a factor of 5. Of course, these results strongly depend on the problem, and the results are comparatively better when the dimension of the problem is increased. Nevertheless, these results offer an idea of the possibilities of using parallel computing for this problem.

3. FFT on multi-core/multi-core architectures

The discrete Fourier transform is one of the most important operations in Digital Signal Processing. Given a vector x = [x1, x2, ..., xN] its DFT is defined as the matrix-vector product: Y = X \cdot x, where X = e^{i2\pi f_i \cdot \mathbf{n}}, \quad i=1 \ldots N. The DFT can be used, among other things, to obtain the frequency spectrum of a signal.

In many applications, the cost of computing the DFT [11] is too high; this is the case, e.g., of real-time applications. In those cases, the fast Fourier transform (FFT) can alleviate this problem of calculating the DFT. In particular, given a vector of size x, the computational cost of the DFT is O(N^2) flops (floating-point arithmetic operations) while FFT requires only O(N \log N) flops. In several experiments, we have evaluated different implementations of the FFT from different libraries on two different parallel architectures based on a multi-core processor and a GPU (see Table 1). Specifically, on the multi-core processor three libraries have been used: MKL (Intel), IPP (Intel) and...
Among the most widespread options in signal processing, the new multi-core / GPU architectures will be likely present in the next few years.

Table 1. Characteristics of the architectures used in the experiments.

<table>
<thead>
<tr>
<th>Processors</th>
<th>#cores</th>
<th>Frequency (GHz)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Intel Xeon Quadcore E5405</td>
<td>8</td>
<td>2.33</td>
</tr>
<tr>
<td>NVIDIA Tesla C1060</td>
<td>240</td>
<td>1.3</td>
</tr>
</tbody>
</table>

The experiments comprised the computation of several FFT of a vector using single-precision arithmetic, with the size of the input vector varying from 8 to 8200 elements. The number of FFT computed in each experiment is proportional to the vector size, so the product between the vector size and the number of executions equals 8388608 (this number ensures more than 1000 executions with the biggest vector size used and is also a multiple of all the employed vector sizes). The performance (in terms of GFLOPS or 10^12 flops per second) is computed using the same reference cost $SV_{\text{ref}} = 1$ for all experiments.

Table 2. Libraries evaluated in the experiments.

<table>
<thead>
<tr>
<th>Library</th>
<th>Version</th>
<th>Architecture</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFTW</td>
<td>3.2.1</td>
<td>Multi-core</td>
</tr>
<tr>
<td>Intel MKL</td>
<td>10.1</td>
<td>Multi-core</td>
</tr>
<tr>
<td>Intel IPP</td>
<td>6.0.2.876</td>
<td>Multi-core</td>
</tr>
<tr>
<td>NVIDIA CUFFT</td>
<td>2.1.1</td>
<td>GPU</td>
</tr>
<tr>
<td>Volkov</td>
<td></td>
<td>GPU</td>
</tr>
</tbody>
</table>

Figure 2 shows the performance obtained when the number of elements of the input vector is a power of two. As it can be seen, the performance of the kernels that operate on the GPU is notoriously higher than that of the multi-core counterparts. Other experiments were carried out for instance taking a prime number of elements of the input vector. In this case, all the FFT kernels suffered an important degradation, with the decrease being especially important for CUFFT, which yields the lowest performance.

A preliminary conclusion from this study is that the FFT kernels in current libraries for the GPU clearly outperform those in libraries for the multi-core processors. However, much work remains to be done to fully optimize both types of kernels.

4. Solving structured systems on GPUs

Structured linear systems can be defined as:

$$T X = B$$

where $T = T_{\text{sym}}$ is a structured matrix, $B = B_{\text{steady}}$ contains the right-hand side vector, and $X = X_{\text{steady}}$ is the sought-after solution vector.

Some structured matrices, like Toeplitz, are characterized by an external structure (e.g., in the Toeplitz matrix all elements along diagonals are equal). Hankel and Vandermonde are also examples of structured matrices with an explicit external structure. The field of structured matrices also includes some classes with non-external structure, like the inverse of a Toeplitz matrix or Cauchy-like matrices. A formal definition of structured matrices is based on the property known as displacement structure [12], which basically sets that there exist one symmetric case or two non-symmetric case matrices $M = (m_{ij})$ containing the same information as an $n \times n$ structured matrix.

Structured matrices appear in a wide range of engineering applications, as, i.e., digital signal processing. Other appealing feature of structured matrices deals with the existence of fast algorithms which allows to solve problem (1) with an order of magnitude lower than classical algorithms that do not exploit its special structure.

We will describe next our approach to solve problem (1) where $T$ is a symmetric Toeplitz matrix using a GPU. To tackle this problem, we first searched for algorithms with an intrinsic parallelism, which allowed the use of a large number of threads working concurrently on the problem. One such an algorithm performs the triangular decomposition (LLDLT, for $L$, lower triangular and D diagonal) of symmetric Cauchy-like matrices. This algorithm works on a so-called generator matrix $G = \text{Gen}_{sym}$, with the property that, with only two columns, it contains all the information of a given symmetric Cauchy-like matrix. The following is a scheme of the algorithm:

```plaintext
for j = 1, n
    for i = j, n
        Use ith row of G to compute li,j entry end for
end for
```

The outer loop processes the columns of $I$, while the inner loop inspects the $i$-th row of the $j$-th column. All the entries of a given column $j$ (inner loop) can be computed concurrently. This motivates the use of a linear array of blocks of threads to compute all these entries in parallel on a GPU.

The solution of (1) using the previous algorithm is carried out by transforming the Toeplitz matrix $T$ into a Cauchy-like one. This is performed by means of the Discrete Sine Transform, an FFT-related transformation that is carried out first in the CPU. When applied to symmetric Toeplitz matrices, this transformation exhibits the property that it results into two independent Cauchy-like matrices of order $n/2$. This benefit allows the solution of larger problems, by solving the two independent systems in turns, overcoming the memory limits of GPU (4GB). Furthermore, it also enables the use of two GPUs in the solution of a single symmetric Toeplitz problem.

Figure 3 shows the performance improvement obtained by using one GPU over the CPU to solve problem (1) with different numbers of independent vectors.

5. GPU array processing

A microphone array is a set of several microphones distributed in the space forming a specific pattern. Since a few decades ago, beamforming algorithms for microphone arrays have been studied and developed in order to improve the Signal-to-Noise Ratio (SNR) of the received signals, or to recover spatially separated signals considering their different Directions of Arrival (DoA) [13]. Nevertheless, one of their main limitations has been their high computational cost in practical acoustic environments where real-time sound processing must be carried out. Therefore we propose in this section an approach to the parallelization of some computations that are common to different beamforming designs.

System Model

Consider the system of Figure 4 where two loudspeakers are emitting two independent signals, $s_1(t)$ and $s_2(t)$, respectively, where $t$ denotes the discrete time instant. At the other part of the room, three microphones are recording the mix of the two signals plus noise. This system can be modeled as a multichannel system with 2 inputs (loudspeakers) and 3 outputs (microphones), and the generalization to a multiple-input multiple-output (MIMO) system can be easily done.

Regarding the system of Figure 4, the problem is how to recover $s_1(t)$ or $s_2(t)$ by means of the signals recorded at the microphones. The approach taken herein makes use of signal processing algorithms to design the broadband beamformers (filters $g_1$, $g_2$, and $g_3$ in Figure 4), once all the room channel responses ($h_i$, in Figure 4) are known. This problem is commonly known as signal de-convolution, and plays an important role in teleconferencing where the speech of interest has to be extracted from the observations of the microphone array but is usually corrupted by noise, room reverberation and other interfering sources.

According to Figure 4, the output of the $n$-th microphone is given by:

$$x_n(t) = \sum_{k=1}^{N} h_{nk} s_{g_k}(t)$$

where $n = 1, 2, \ldots, N$, being $N$ the number of microphones and $m$ the number of source signals, that is equal to the number of loudspeakers in Figure 4.
4. $Z_g$ is the length of the longest room impulse response of all acoustic channels $b_n$. The noise contribution has not been considered for sake of clarity.

This signal model can be rewritten in vector/matrix form as $x_i(k) = H_i x_i(k)$, where $x_i(k)$ is a column vector and $x(k)$ and matrix $H$ are defined as $x(k) = [x_1(k), x_2(k), \ldots, x_N(k)]^T$, where $x(k) = [x_1(k), x_2(k), \ldots, x_N(k)]^T$, and $H = [H_1, H_2, \ldots, H_N]$, where $x_1(k), x_2(k), \ldots, x_N(k)$ for $n=1,2,3$ and $m=1,2$. $f_i$ denotes the transpose of a vector or a matrix and $L_g$ is the length of the longest channel impulse response. Once the recorded signals $x_i(k)$ have been modelled, the broadband beamformers ($f_i$) have been designed and calculated. Benesty et al. [14] present an excellent state-of-the-art of the main algorithms used in audio applications. Some of them make use of the channel matrix $H$ exclusively and calculate $f_i$ based on its inverse (or pseudo-inverse), whereas, other methods take also into account the correlation matrix of the recorded signals. Under perfect estimation of channel impulse responses, both types of filters show similar good performance, but in practical experiments, where the $h_n$s are estimated under some uncertainties, filters based on the estimated correlation matrix outperform those based on the channel inversion.

The estimated correlation matrix of spatially sampled signals, $x_i(k)$, is commonly known in the literature as the Sample Correlation Matrix (SCM), and its expression is given by:

$$R_{xx}(k) = \frac{1}{N} \sum_{n=1}^{N} x_i(n) x_i(n)^T$$

where $x_i(k) = [x_1(k), x_2(k), \ldots, x_N(k)]^T$.

Regarding the dimensions of SCM, $(L_g \times N_g)$, $L_g$ depends on the length of the room impulse response $l_n$ and is usually greater than 150. Considering that (3) has to be recalculated at short time intervals due to the non-stationary nature of sound signals, and that $N > L_g$, we always ensure that $R_{xx}$ is full-rank and invertible, then an efficient parallelization of the computation in (4) is required. Three different implementations have been considered in order to obtain the matrix correlation as fast as possible. In one hand the sequential implementation and in the other two different parallel implementations: one in multiple cores of CPU and the other in the GPUs.

The sequential implementation

The sequential implementation of the Sample Correlation Matrix of (3) is recursively done by a ‘for’ loop which calculates one vector outer multiplication and one matrix sum correspondent to index k of the total sum at each iteration. Its implementation can be seen schematically in Figure 5, where vector $x_i(k)$ is split in smaller overlapping vectors of $Z_g$ length each.

Parallel implementations of the Sample Correlation Matrix of (3)

1) Parallelization in CPU multi-core:

In this case the parallelization consists in dividing the sequential tasks described above in different CPU cores. To achieve that the Matlab toolbox for parallel computing has been used, more specifically the functions parallelpool and spmd [15].

2) Parallelization on GPU:

In this case, the parallelization is performed at a lower level than in CPU. For this purpose, the software interface called Jacket [16], which allows running code in the GPU through Matlab, has been tested. The following steps have been taken:

- First, to send microphone array signals $x_i(k)$ to the GPU using Jacket function gddualu().
- Second, to parallelize (3) so no iteration of ‘for’ loop must depend on a previous result. Then parallelization of the for loop is done using the Jacket function gspmd().

The step 2 has been carried out splitting each vector $x_i(k)$ of (4) in basic blocks of variable length $Z$. The performed parallelization in GPU for $Z = L_g/2$ can be seen in Figure 6. Let us denote $x_i(k)^{(n)}$ as the n-th block. Considering that $x_i(k)$ has length $L_g$, the number of available blocks $N_{g,n}$ is $N_g/L_g - Z$. Therefore, the single outer product $x_i(k) x_i(k)^T$ of (4) is now computed in parallel at the GPU though $(2N_g)k$ outer products $x_i(k)^{(n)} x_i(k)^{(n)}$. Figure 6 shows the case for $N_g = 3$ microphones, so there are $3N_g = 6$ basic blocks available, and $x_i(k) x_i(k)^T$ will be computed within $(2N_g)^2 = 36$ outer products in parallel.

Three different lengths of basic blocks have been tested in GPU: $L_g/4$, $L_g/2$ and $L_g$. Which results in $N_g' (L_g/4)$ and $N_g' (L_g/2)$ outer products of block vectors $x_i(k)^{(n)}$ for the system depicted in Figure 4 with $N_g' = 3$ the different sizes of 2 give a parallelization of 9, 36 and 81 independent outer products for step 2, respectively.

Testing Results

Sequential implementation and both parallel implementations explained above for 3 recorded signals $x_i(k)$ at sampling frequency of 11 kHz have been tested with an i7 CPU of Intel and a NVIDIA GTX285 GPU. Results obtained for signals of duration 4 seconds (44 samples) can be seen in Table 3. The GPU parallel method has been carried out with 3 cores, it has been proved that for this kind of computation it was the best configuration. As we can see in Table 3, the parallel implementation with multiples cores of a CPU only obtain speed up greater than 1 when $Z_g \leq 210$ comparing to sequential implementation, achieving almost double velocity in the best case of $Z_g = 110$. An explanation for this low performance may be that too much time is lost distributing tasks into the different cores of the CPU, whereas the code to be distributed consists only in a few lines. Moreover, results of Table 3 show that computational time grows exponentially when $Z_g$ exceeds $l_n = 210$; we suppose that in this case, length of filters $f_i$ is very large and buffer memory of the cores collapses: big amounts of data are replicated in all buffers, and that makes such a significant time increase. For $l_n = 300$, Matlab returns a memory error because there is not enough memory allocated to matrices with such big dimensions.

Regarding GPU implementation, Jacket performs with matrix of maximum 65,536 elements. As we see in figure 6, dimensions of SCM depend on $l_n$, so working within GPU configuration divided in 9 parts, when $l_n \geq 260$ dimensions of SCM exceeds 65,536 elements, so Jacket program returns a memory error. To solve this problem we divide the calculation of SCM in more number of parts, and as Table 3 show, as $L_g$ grows, the number of parts used in the calculation of the matrix correlation must be incremented to reach the best time results. Otherwise, if we took the most efficient case for each length of filter, $l_n$, it can be shown that the speed-up in all the cases is near to 4, which means a significant time saving. Same results of Table 3 are depicted in Figure 7, where the graph at right shows those methods whose computation times are below 2 seconds. It should be noted that GPU outperforms sequential and multi-core implementations in all cases.

Finally it should be noted that, considering the duration of the recorded signals, 4 seconds delay in calculating the matrix correlation of one to three tenths of a second is attainable for real-time applications.

6. Adaptive algorithms with parallel combinations on multi-core platforms

During last years, adaptive systems [17] have been the objective of many studies due to their
and low residual noise level. However, this bet- ter behavior appears at the expense of doubling the computational cost, since two algorithms have to be executed at the same time, in parallel. The parallel nature of this structure allows the distribution of the computation within parallel hardware like the multi-core systems, where the computational load can be easily dealt out among different cores and thus the execution time reduced. Therefore, the adaptive algorithm could be used in systems working at a higher sampling rate. The computations needed could be carried out in two kernels, using a third kernel to combine both algorithms, or using one of the first kernels to combine the signals if there are only two kernels. Next, Figure 8 shows the block diagram of the convex structure executed over a multi-core platform.

As it can be checked in Figure 8, apart from the convex combinations, the rest can be executed in a parallel way. Therefore, the execution time has been reduced to the time that a single filters needs to carry out the computations. In other words, think of this structure and the use of two kernels, the time required by the process becomes the time needed by one single kernel, instead of the double time required by a sequential implementation using monochrome structures. Figure 9 exhibits the algorithm runtime per iteration and the comparison with the sequential version executed in a single core system. It shows the relation between the execution time and the length of the adaptive filters used in the convex structure when LMS algorithm is used as a controller of the adaptive filters. This test has been carried out on an Intel Core i7 CPU 920 @ 2.66GHz, and the algorithm was run in a Matlab R2009b platform using Parallel Computing Toolbox V4.2.

As it can be seen in Figure 9, the reduction of the algorithm runtime using a platform of two kernels is really significant. This structure only needs half runtime of the sequential one. The most important conclusion is that it will be possible to work with higher sampling frequencies in order to deal with signals with higher bandwidth, or just to have adaptive algorithms which require high computational load needing less time to carry out this operation.

7. Future Prospects

In this section, we present two potential applica- tions in Signal Processing which are focused on the implementation of the multichannel convolu- tion and the decoding of LDPC codes using the capabilities of GPU computing.

Multichannel convolution
It can be shown that, the computation of the convolution operation consists of several scalar multiply and add operations [22], where a certain parallelism can be identified. In order to compute the convolution, the architecture of the GPUs allows different levels of parallelism. At a first level, a single convolution operation of two signals can be efficiently implemented in parallel inside a GPU. The second level of parallelism allows carrying out different convolutions of different chan- nels parallelly. Note that, obviously, the benefits of using a GPU increase when both levels of par- allelism are exploited simultaneously.

The possibilities that GPUs offer are varied. How- ever, the main challenge when implementing an algorithm GPU relies on adapting the require- ments of the GPU to obtain the best performance de- pending on the properties of the signals (mono- channel, multichannel, etc.) and, of course, the type of processing that wants to be carried out:

Convolution of all the signals either by the same k[i] or with different h{k} and with c[i]=h[k]...+h[l]=1, convo- lution of some signals by k[i] and others by h[k]... and all of them at the same time, etc.

Recently, the new CUDA toolkit 3.0 lets use CUFFT [11] with the property concurrent copy and ex- ecution, and therefore, implementing real-time applications where the latency of transferring the samples from the GPU to the CPU for processing and vice versa overlapped by computation.

LDPC Codes on GPU

Low-Density Parity-Check codes (LDPC codes) are linear block channel codes for error con- trol coding with a sparse parity-check matrix (a matrix that contains few ones in comparison to the amount of zeroes). They have been recently adopted by several data communication stand- ards such as DVB-S2 and WiMax. The concept of LDPC coding was first developed by Robert G. Gallager in his doctoral dissertation at the MIT in the beginning of sixties [23] but quickly forgot- ten due to its impractical implementation at that moment and the introduction of Reed-Solomon codes. They were rediscovered by MacKay andNeal in 1996 [24].

These codes provide a performance very close to the Shannon capacity limit of the channel, low error floor, and linear time complexity for decoding (lower than turbocodes). We can find simple tutorials to understand the basics of these kind of codes in [25], [26], and software to test them in [27]. LDPC codes are inherently suited for par- allel hardware and software implementations as we can read in [28], [29] and [30] using CUDA and other GPU programming tools.

LDPC codes can be represented graphically by a Tanner graph [31] (an undirected bipartite graph with variable nodes, c_i, and check nodes, f_i). An example is shown in Figure 10, that corresponds with the parity-check matrix on its left H:

LDPC decoders are based on variations of belief propagation, sum-product or message pass- ing algorithms. In any of these algorithmic de- nomenclations, information flows from/to variable nodes and from/to check nodes until the algo- rithm converges to a stable state, finding the most likely transmitted codeword. An easy example can be observed in the bit flipping al- gorithm (hard decision decoding). The iterations are divided in two dependent steps:

1. Each variable node sends the majority vot- ed bit to all its connected check nodes (at the beginning, the only information avail- able is the received bit)
2. Each check node estimates each connected variable node bit as the parity-check matrix dictates (using the estimations of the rest of the connected variable nodes and excluding the value that is estimating) and send this information to this connected variable node.

Figure 10. Tanner graph of a linear block code parity-check matrix H.
These steps are executed iteratively until the estimated word is a codeword. Better results are obtained when soft decision is used [32]. It can be observed that the computations within the check nodes and within the variable nodes are alternated and interdependent in time, so they must be executed one after another because of their inherent sequentiality. The computations in each check node are independent, so they are perfectly parallelizable; the same happens with the variable nodes computations. Within a check node, it must be computed a different result to every variable node that is connected to it. Something similar is observed regarding to the variable node computations. Figure 11 shows the dependency graph of the parallel algorithm.

We are focusing our implementations on concentrating the operations within each node because each result in a node shares nearly all the multiplication factors that it contains. Another important question is how the accesses to the global and shared memory are arranged in order to make a coalesced access and to avoid conflicts in the memory banks. This will ensure a good speedup in a real time environment.

9. Conclusions

Throughout this article it has become obvious the importance of new multi-core / GPU architectures in the field of signal processing. Among the most widespread options in signal processing, these new architectures will be likely present in the next few years. However, it is also very likely that FPGA devices keep a good share of the market, as they cover a large part of very specific applications.

The purpose of this work was to serve as a show-case of different signal processing applications in which new Multi-core/GPU architectures can be competitive. Different applications, in which researchers of INCO2 Group are working, have been used as case studies. The aim of this group is precisely the application of high performance computing in next-generation parallel architectures (particularly multi-cores and GPUs) in the solution of problems in signal processing. We believe this option is a sure bet in one of the most promising areas of current technology, in general, and the Information Technology area, in particular, where the duo computer-communications can not be dissociated.

Acknowledgements

This work was supported by Generalitat Valenciana Project PROMETEO/2009/013 and partially supported by Spanish Ministry of Science and Innovation through TIN2008-06570-C04 and TEC2009-13741 Projects.

References

[1] www.inco2.upv.es

[20] M. Ferrer, M. de Diego, A. Gonzalez, and G. Piñero, “Convex combi node that is connected to it. Something similar is observed regarding to the variable node computations. Figure 11 shows the dependency graph of the parallel algorithm.

We are focusing our implementations on concentrating the operations within each node because each result in a node shares nearly all the multiplication factors that it contains. Another important question is how the accesses to the global and shared memory are arranged in order to make a coalesced access and to avoid conflicts in the memory banks. This will ensure a good speedup in a real time environment.

9. Conclusions

Throughout this article it has become obvious the importance of new multi-core / GPU architectures in the field of signal processing. Among the most widespread options in signal processing, these new architectures will be likely present in the next few years. However, it is also very likely that FPGA devices keep a good share of the market, as they cover a large part of very specific applications.

The purpose of this work was to serve as a show-case of different signal processing applications in which new Multi-core/GPU architectures can be competitive. Different applications, in which researchers of INCO2 Group are working, have been used as case studies. The aim of this group is precisely the application of high performance computing in next-generation parallel architectures (particularly multi-cores and GPUs) in the solution of problems in signal processing. We believe this option is a sure bet in one of the most promising areas of current technology, in general, and the Information Technology area, in particular, where the duo computer-communications can not be dissociated.

Acknowledgements

This work was supported by Generalitat Valen-

biographies

Francisco José Martinez Zaldívar
Pedro Alonso
Alfredo Remón
Maria de Diego was born in Valencia, Spain, in 1970. She received the Telecommunication Engineering degree from the Universidad Politécnica de Valencia (UPV) in 1994, and the Ph.D degree in 2003. Her dissertation was on active noise conformation of enclosed acoustic fields. She is currently working as Associate Professor in digital signal processing and communications. Dr. de Diego has been involved in different research projects including active noise control, fast adaptive filtering algorithms, sound quality evaluation, and 3-D sound reproduction, in the Institute of Telecommunications and Multimedia Applications (iTEAM) of Valencia. She has published more than 40 papers in journals and conferences about signal processing and applied acoustics. Her current research interest include multichannel signal processing and sound quality improvement.

Miguel Ferrer was born in Almería, Spain. He received the Ingeniero de Telecomunicacion degree from the Universidad Politécnica de Valencia (UPV) in 2000, and the Ph.D degree in 2008. In 2000, he spent six months at the Institute of applied research of automobile in Tarragona (Spain) where he was involved in research on Active noise control applied into interior noise cars and subjective evaluation by means of psychoacoustics study. In 2001 he began to work in GTAC (Grupo de Tratamiento de Audio y Comunicaciones) that belongs to the Institute of Telecommunications and Multimedia Applications. He is currently working as assistan professor in digital signal processing in communications Department of UPV. His area of interests includes efficient adaptive algorithm and digital audio processing.

Sandra Roger was born in Castellón, Spain, in 1983. She received the degree in Electrical Engineering from the Universidad Politécnica de Valencia, Spain, in 2007 and the MSc. degree in Telecommunication Technologies in 2008. Currently, she is a PhD grant holder from the Spanish Ministry of Science and Innovation under the FPU program and is pursuing her PhD degree in Electrical Engineering at the Institute of Telecommunications and Multimedia Applications (iTEAM). In 2009, she was a guest researcher at the Institute of Communications and Radio-Frequency Engineering of the Vienna University of Technology (Vienna, Austria) under the supervision of Prof. Gerald Matz. Her research interests include efficient data detection, soft demodulation and channel estimation for MIMO wireless systems.

Jorge Lorente was born in Algemesí, Spain in 1985. He received the Ingeniero Técnico de Telecomunicación degree from the Universidad Politécnica de Valencia, Spain, in 2007 and the MSc. degree in Telecommunication Technologies in 2010. Currently, he is working at the Institute of Telecommunication Technologies and Multimedia Applications (iTEAM). His research focuses on microphone-array beamforming algorithms and parallelization of signal processing problems on the different cores of a CPU and also on GPU.

Carles Roig was born in Alginet, Spain, in 1986. He received the degree in Telecommunication Engineering from the Universidad Politécnica de Valencia, in 2010. Currently, he works as a research assistant within the Institute of Telecommunications and Multimedia Applications (iTEAM). His research interests include adaptive filtering and its applications to the active noise control.