The fine structure of spectral properties for random correlation matrices: an application to financial markets

We study some properties of eigenvalue spectra of financial correlation matrices. In particular, we investigate the nature of the large eigenvalue bulks which are observed empirically, and which have often been regarded as a consequence of the supposedly large amount of noise contained in financial data. We challenge this common knowledge by acting on the empirical correlation matrices of two data sets with a filtering procedure which highlights some of the cluster structure they contain, and we analyze the consequences of such filtering on eigenvalue spectra. We show that empirically observed eigenvalue bulks emerge as superpositions of smaller structures, which in turn emerge as a consequence of cross-correlations between stocks. We interpret and corroborate these findings in terms of factor models, and and we compare empirical spectra to those predicted by Random Matrix Theory for such models.

We study some properties of eigenvalue spectra of financial correlation matrices. In particular, we investigate the nature of the large eigenvalue bulks which are observed empirically, and which have often been regarded as a consequence of the supposedly large amount of noise contained in financial data. We challenge this common knowledge by acting on the empirical correlation matrices of two data sets with a filtering procedure which highlights some of the cluster structure they contain, and we analyze the consequences of such filtering on eigenvalue spectra. We show that empirically observed eigenvalue bulks emerge as superpositions of smaller structures, which in turn emerge as a consequence of cross-correlations between stocks. We interpret and corroborate these findings in terms of factor models, and and we compare empirical spectra to those predicted by Random Matrix Theory for such models.

I. INTRODUCTION
In Physics, Random Matrix Theory (RMT) is mainly used to model systems of particles interacting according to unknown laws. This is particularly handy for studying energy levels of complex systems such as heavy nuclei and mesoscopic systems. In such cases, the Hamiltonian operator can be conveniently described by a random matrix featuring some suitable symmetry properties. In particular, two matrix ensembles have been commonly used: the Gaussian Orthogonal Ensemble of real symmetric random matrices, and the Gaussian Unitary Ensemble of Hermitian random matrices [1,2]. In both these cases, for proper normalization of matrix elements, the asymptotic statistical properties of the eigenvalues follow the so-called semicircle law: where ρ(λ) is the marginal probability density function of the eigenvalues. Until recent years, physicists often neglected the study of random correlation matrices, even though they find applications in very diverse fields ranging from biology to econometrics. For this reason, applied mathematicians have studied such objects since the 1920s [3]. The asymptotic eigenvalue statistics in this case is given by the Marčenko-Pastur distribution [4], which will be extensively discussed in the following sections. Since the late 1990s, thanks to the growing interest in financial markets as prototypes of complex systems, physicists started working on random correlation matrices [5,6], and this will be the subject of this paper as well.
We consider a set of N stocks whose spot price at time t we denote as S i (t), i = 1, . . . , N . Let t 1 , . . . , t T +1 be T + 1 equally spaced time instants, then we introduce the corresponding log-returns typically, one can think of the t i as days. This notation is a little redundant, and we can simply denote time steps as j = 1, . . . , T + 1. Now, we can assume that the T recorded log-return values are realizations of N × T random variables R (i) j , so that we globally end up with N T observations r ij , i = 1, . . . , N , j = 1, . . . , T . Equivalently, the vector r i . = (r i,1 , . . . , r i,T ) (3) containing all the observations of the ith asset returns, can be seen as a realization of a vector random variable R (i) . Such a framework is fully characterized by finite probability distributions [7,8]: 1 , . . . , R T ∈ A where A (i) j ∈ R ∀i, j and B (i) ∈ R T ∀i. Depending on the choice of the random variables R (i) s, such a picture allows for a huge variety of possible descriptions of the stochastic dynamics of financial data. Most simply, a standard assumption, according to which the log-returns are described by uncorrelated Gaussian processes ((r 1,i , . . . , r N,i ) ∼ N (0, 1 N ), where 1 N represents the N × N identity matrix), could be adopted. However, as is well known [9], correlations often play a major role, and a realistic description of financial markets should by no means neglect them. Still, a Gaussian framework can be retained by observing that a set of zero-mean correlated Gaussian numbers generated by a stationary stochastic process is completely characterized by its expectation value vector µ and covariance matrix E: Following [10], in this paper we shall simplify this structure to the assumption that cross-correlations between assets and auto-correlations in time factorize: In the above equation C ik (A jl ) represents an element of a N × N (T × T ) positive-definite symmetric matrix C (A). We shall keep this same kind of notation, i.e. denoting matrices by bold letters and the corresponding matrix elements by the same non-bold letters, throughout the rest of the paper. We shall assume the C matrix of cross-correlations to be constant over time. Also, most importantly, we shall neglect all possible correlations in time by assuming A = 1 T : This last assumption is well motivated both from an empirical viewpoint [9] and a theoretical one, since asset returns can be shown not to display auto-correlations whenever assets are assumed to be described by a sub-martingale. As a matter of fact, from the sub-martingale property one can show that where assuming no dividend payment in the period and . This relation means that returns are uncorrelated (not necessarily independent) random variables, as can be empirically verified [9]. In the following, we shall always assume the previously mentioned condition (S i (t j+1 ) − S i (t j ) ≪ S i (t j )) to be fulfilled, thus allowing to identify log-returns and returns (as in equation (9)). The Gaussian probability measure leading to the correlation structure (7) can be shown to be where R is a rectangular N × T matrix containing all of the returns observations ( T j=1 dR ij is the flat integration measure over matrix elements. Being symmetric, the C matrix in (10) is made of N (N + 1)/2 independent entries. Now, the typical challenge to be faced in many multivariate analysis problems is to estimate these numbers from N time series of T observations, i.e. N T data points. When such data are collected in a N × T matrix R as in (10), then a standard estimator for C is given by the matrix c . = RR T /T . In other words, an estimator for the matrix element C ij in C is given by which is the well-known Pearson estimator for large T (for small values of T the 1/T factor would need to be replaced by 1/(T − 1)). Of course, the c ij s are a noise-dressed representation of the C ij s. As a matter of fact, even though the random variables in R were exactly described by the probability distribution in (10) (i.e. by the correlation matrix C), the finiteness of the data sample under study would anyway cause the c ij s to deviate, on average, from their "true" counterparts C ij s. As it is intuitively clear, the two will become closer as more observations are collected, i.e. as T → ∞, or equivalently as q → 0, where q is the so called "rectangularity ratio": However, realistic situations in financial practice typically involve large numbers of variables and similarly large numbers of observations. Ideally, this is not far from a "thermodynamic limit" situation in which Remarkably, this is precisely the regime under which some powerful RMT results are valid [4]. In particular, this is the limit under which it is possible to make analytical statements about the relation between the eigenvalue spectra of the theoretical covariance matrix C and its estimator (11) [11][12][13]. We shall exploit those results in the following sections.
A very general class of models fulfilling (7) is the one of the so called factor models [14][15][16][17][18]. Such models aim at describing the time evolution of each asset in terms of a few "driving forces", or factors, which typically describe the impact that a market sector or the whole market itself have on a given asset. In a K factors model, the time evolution of asset returns is given by where g (j) i and g (0) i are constant parameters, whereas m j and ǫ i are independent and identically distributed normal random variables. We shall assume these latter to be normalized as follows: In the next section we shall specialize the model in (14) to a particular case. However, in a very general fashion, factor models have proven to be able to reproduce, at least qualitatively, some relevant features of empirical covariance matrix eigenvalue spectra. The general appearance of the return covariance matrix eigenvalue spectrum of a given number of assets (for zero mean and unit standard deviation data) is the one depicted in Figure 1 for the log-returns of the daily prices for the assets composing the S&P500 and FTSE350 Indices. Three main features are clearly visible: a large bulk close to zero, a number of larger eigenvalues "leaking out" of such bulk, and a much larger and isolated eigenvalue. Since the pioneering works [5,6], RMT has become a standard tool to analyze these macroscopic features. More specifically, the aforementioned eigenvalue bulk has mostly been identified with the Marčenko-Pastur distribution [4], i.e. the limiting eigenvalue marginal probability density for the (already introduced) matrix c = RR T /T when all the entries R ij are drawn from a normal distribution N (0, σ). Quite importantly, this result is rigorously derived only in the thermodynamic limit (13) of infinite matrix sizes growing to infinity at a fixed rate. In this limit, the Marčenko-Pastur distribution reads where q is the rectangularity ratio defined in (12). However, as it can be seen in Figure 1 (b)-(d), the Marčenko-Pastur distribution actually provides a very poor fit of empirical distributions when q and σ are assumed to be equal to N/T and 1 (for standardized data), respectively. The aforementioned eigenvalue bulks are reasonably well fitted by a Marčenko-Pastur distribution only when q and σ are assumed to be free parameters, whose values are to be determined via fitting. In particular, this typically causes q to deviate from the ratio N/T , thus introducing the concept of effective system size.
Since the Marčenko-Pastur distribution emerges as the limiting density for the covariance matrix of N uncorrelated time series made of T observations, identifying eigenvalue bulks such as the ones in Figure 1 with it basically amounts to state that most of the information contained in empirical covariance matrix spectra is actually no information at all, being equivalent to the spectrum one would obtain in the presence of pure noise [12,20]. On the other hand, this viewpoint allows one to give a specific meaning to the "large" eigenvalues out of the bulk. As it would also be possible to verify with Principal Component Analysis (PCA) [21], such eigenvalues correspond to groups of correlated assets, most typically belonging to the same market sector. Analogously, the largest eigenvalue of the distribution is usually identified with the "market mode": such an eigenvalue appears as a consequence of those fluctuations that involve the market as a whole, and as a matter of fact the PCA can easily show it to account for a large part of the return variance.
As already anticipated, factor models (14) represent good candidates to reproduce most of the empirical features shown in Figure 1. In the following sections, we shall make use of such models to challenge the previously mentioned common knowledge, according to which the eigenvalue bulks in empirical covariance matrix spectra essentially correspond to noise. Such a common knowledge has already been revised critically in a number of works (see for example [13,[22][23][24][25]), and in this paper we wish to present an additional amount of evidence in this direction. The paper is organized as follows. In Section II the "direct" problem of analytically estimating eigenvalue densities is addressed. In particular, some specific versions of the factor model in equation (14) will be introduced and the eigenvalue spectra for the correlation matrix C of such model will be derived (sometimes performing approximations). Then, the RMT results provided in [12,13] will be applied in order to derive exact results for the noise-dressed version c of the correlation matrix. Eventually, a subsection will be devoted to discuss the results obtained via Monte Carlo simulations in order to validate those analytical results. In the light of such numerical results, we shall also briefly discuss again the applicability limits of the Marčenko-Pastur distribution. In Section III, the "inverse" problem of inferring eigenvalue densities from the empirically observed ones will be discussed. More specifically, a filtering procedure will be devised in order to highlight some of the cluster structure in empirical correlation matrices. Such a procedure will be performed on two data sets (relative to the S&P500 and the FTSE350 Indices), and the results will be interpreted in terms of factor models. Eventually, in Section IV some conclusions and possible future perspectives of this work will be outlined.

II. THEORY: THE DIRECT PROBLEM
A. Cluster models: heuristic analysis Let us now specialize the factor model (14). In particular, let us start from the situation where all asset returns obey the following equation where In the previous equation m N represents a common mode driving all assets with the same "intensity" γ N . We shall now build K clusters of correlated assets from equation (17). Thus, let there be K groups of N k variables (k = 1, . . . , K) withN . = K k=1 N k ≤ N , and let us order the assets so that r i belongs to the kth assets for i = 1 + We can also denote the generic element in the kth cluster as r where γ k ∈ [0, 1], m k (t) ∼ N (0, 1) is a cluster mode and r i is as in equation (17). Thus, we can rewrite the previous relation as We still simply call r i (i = 1 +N , . . . , N ) those elements which do not belong to any cluster, and we assume them to evolve according to (17). We always have Recalling the relations in (15), which can be generalized to include m N in a straightforward way, we can calculate all possible covariance matrix elements between assets described by (17) and (19). Four separate cases can be distinguished: We are now in position to compute the correlation matrix C of the model, whose matrix elements read with straightforward generalizations to those involving elements belonging to clusters. We shall focus for now on the limiting case in which correlations between cluster elements are very strong, i.e. when γ k → 1 in each cluster. One can see from (20) that under this assumption the model's correlation matrix has a simple block-diagonal structure: where E M is the M ×M matrix whose entries are all equal to unity ( matrix with a slightly more complicated structure: The block structure in (22) allows for the computation of the eigenvalue spectrum. In fact, since we have the characteristic equation for the C matrix reads: This eigenvalue spectrum is able to reproduce, at least on a heuristic level, some of the features of empirical spectra (see Figure 1): each cluster gives rise to a large eigenvalue equal to the cluster size N k , and the common mode produces It is worth mentioning that this latter eigenvalue might not necessarily be the largest one: as a matter of fact, large enough N k s and a small γ N can lead to situations in which the largest eigenvalue is given by max k N k . Even though this seems not to be the case in most financial applications, it is still worth stressing that the largest eigenvalue in empirical spectra should not be labelled as the "market eigenvalue" right away, but only after some further checks (as, for example, the inspection of the corresponding eigenvector). Going back to equation (25) ) can be recognized. Also, equation (25) indicates that each cluster gives rise to N k − 1 zero modes, altogether forming a group ofN − K zero modes. In a noise-marred situation, as it can be verified by means of Monte Carlo simulations, the degeneracies in (25) are broken and give rise to two bulks. In the highly correlated cluster assumption (γ k → 1) yielding (25) the two bulks typically remain well separated. However, when such assumption is relaxed, allowing for small values of the γ k s, the two bulks get closer, and for properly chosen values of the parameters they eventually "collide" and merge into one single structure (see Subsection II C and the figures in it). This latter might be identified with the typical eigenvalue bulks appearing in empirical spectra (see Figure 1). It is important to stress, already at this heuristic level, that the emergence of such a bulk in this factor model stems from the presence of (weak) correlations between the assets, oppositely to the Marčenko-Pastur distribution (16), which in turn, as already discussed, originates from pure noise. Nevertheless, quite subtly the Marčenko-Pastur distribution can still provide good fits to such bulks in a number of situations, as we shall illustrate later. The previous factor model, yielding equation (25) for the eigenvalue spectrum of its correlation matrix, can be further simplified to the case where no common factor is driving the asset returns. This can be directly achieved on the eigenvalue spectrum by setting γ N = 0 in equation (25). This gives This spectrum still yields one large eigenvalue for each cluster and two degenerate eigenvalues equal to zero and one, respectively. Just like in the previous discussion, let us now relax the assumption of strong correlations (γ k → 1) within clusters. For the sake of simplicity, let us assume that all assets in each cluster are mutually correlated with the same correlation coefficient ρ k ∈ [0, 1] (which can be explicitly computed from (20)). So, the correlation matrix of the model would read = 1, while the last block is now given by the identity matrix (as it can be seen from the first relation in (20) for γ N = 0). One can verify that In order to further simplify things, let us consider the case where we have just one cluster ofN assets with mutual correlation ρ. Equation (25) in this case would need to be modified to read thus giving one large eigenvalue, and two degenerate eigenvalues equal to (1 − ρ) and one. The latter emerges as a consequence of the N −N mutually uncorrelated assets, i.e. as a consequence of pure noise, while the former is due to the presence of a cluster. Just like in the case discussed previously, a noise-dressed version of (29) would lead to two eigenvalue bulks, and suitably chosen values of ρ would make the two bulks merge into one (see Subsection II C). Thus, in this case too, the emergence of a main bulk would not be a consequence of pure noise alone.

B. Cluster models: exact results
The proper mathematical framework to deal with covariance matrices featuring degenerate spectra (as the ones in equations (25), (26) and (29)) is the one provided in [12,13], and we shall exploit it extensively in the following. So, first let us introduce some basic notions and notations of RMT. Just like we did so far, we shall denote the eigenvalues of the correlation matrix C of a given model as Λ i (i = 1, . . . , N ), while the eigenvalues of the corresponding estimator (11) will be denoted as λ i . Quite straightforwardly, one can define the eigenvalue density for the theoretical correlation matrix as and this is related to the matrix moments M In analogy to (30), one can define an expected spectral density for the estimator c in equation (11): where the expectation is to be meant with respect to the probability measure (10). Generalizing (31), we can then define the expected matrix moments as The two corresponding resolvents, or Green's functions, are given by: where Z, z ∈ C. Then, one can introduce the moment generating functions, and it is possible to show that they are closely related to the Green's functions in the following way where we have Moreover, from the well known relation lim ǫ→0 + (λ + iǫ) −1 = P(λ −1 ) − iπδ(λ) (where P denotes the principal value), one can show that the eigenvalue densities (30) and (32) can be directly derived from the corresponding Green's functions (34): ρ c (λ) = − 1 π lim ǫ→0 + Im g c (λ + iǫ). So, basically, the Green's function contains the same information as the whole eigenvalue density, and the same, through (35), is also true for the moment generating function. In particular, for Λ, λ > 0, the previous relations can be converted into: A fundamental relation between between the moment generating functions of a "true" correlation matrix and its estimator in the infinite matrix size limit (13) can be derived [12,13] either in the framework of Free Random Variables [10,26,27] or using planar diagrammatic methods [12,28]. This derivation will be outlined in Appendix A. The starting point is the following simple relation between moment generating functions where the two complex arguments are related by the following transformation: Once M C (Z) is known, m c (z) can be derived in principle from the following functional equation: Bearing in mind the previous discussion on factor models, we shall focus on correlation matrices whose spectra display degenerate eigenvalues. Let us then assume the correlation matrix C to have L distinct eigenvalues Λ i (i = 1, . . . , L) with degeneracies n i . The moment generating function for such a matrix is given by where the weights w i = n i /N have been introduced. Thus, from (41) we get For each fixed z, this becomes a polynomial equation of degree L + 1 in m c (z)m, yielding as many solutions. The problem arises of choosing the right one: as extensively discussed and detailed in [29], the right branch of the map in equation (40) to pick up is the one giving Z → z for z → ∞. In the simplest case one has C = I N and, of course, the correlation matrix has just one N -fold degenerate eigenvalue equal to one: it can be shown that, in this case, equation (43) leads precisely to the Marčenko-Pastur distribution (16), as one would expect. On the other hand, already when considering two distinct eigenvalues, quite different scenarios are possible, including the previously discussed cases of well separated or merging bulks (see the next subsection). Let us also remark that equation (43) cannot be applied to the large non-degenerate eigenvalues typically displayed by factor models. This is because, as already stated, we shall always work in the thermodynamic limit (13), where the weight (1/N ) of such eigenvalues vanishes as N → ∞. As a matter of fact, this kind of eigenvalues need to be investigated per se, and actually extensive areas of the RMT literature are devoted to the study of statistical properties of single eigenvalues as well as order statistics [30]. In particular, it has been shown in [31] that large non-degenerate sample eigenvalues of correlation matrices follow a normal distribution (see the next subsection for a numerical confirmation).

C. Monte Carlo simulations
In this subsection we present and detail the Monte Carlo simulations we performed in order to test and validate the analytical results described so far. In all cases, we generated T realizations of N stochastic processes described by the factor model introduced in equations (18), (19) and (20) (from a numerical viewpoint, this just boils down to the generation of standard Gaussian random numbers). By choosing different parameter values, we implemented the different versions of the model which were discussed in the previous subsection, corresponding to different theoretical correlation matrices (equation (22), (27)). The eigenvalues of the corresponding estimators (11) were obtained via numerical diagonalization (by means the diagonalization algorithm provided by Matlab R ). In Figure 2, a first example of eigenvalue spectra deriving from factor models is presented. In this first example a common mode (introduced via a non-zero γ N coefficient, see the figure caption for all the details on parameter  (18), (19) and (20) with N = 500, T = 2000 (q = 0.25). One cluster made of N1 =N = 100 variables, correlated via a coefficient γ1 = 0.7, is present. A common mode is introduced via a coefficient γN = 0.3. As expected from equation (25), two eigenvalue bulks are clearly visible. The mean value in the bulk on the left is 0.04, while equation (25) would predict zero as a consequence of the strong correlation limit (γ k → 1) approximation. On the other hand, the mean value in the bulk on the right is 0.85, in remarkable agreement with the predicted value (1 − γN ) 2 /((1 − γN ) 2 + γ 2 N ) = 0.84. For the sake of readability, the "large" eigenvalues are not shown. (b) By setting γ1 = 0.4, the two bulks in (a) merge into a single one. Such a structure, despite emerging as a consequence of (weak) correlations, is very well fitted by a Marčenko-Pastur distribution (see equation (16)) with q = 0.29 and σ = 0.88. values) as well as a correlated cluster of variables are present. As already discussed in the previous subsection, the degeneracies in equation (25) are broken, and, in the limit of strong correlations in the cluster (γ k → 1), two well separated eigenvalue bulks emerge (Figure 2 (a)). On the other hand, when such correlations get weaker, the two eigenvalue bulks get closer, eventually melting into one single structure ( Figure 2). Remarkably, such a structure is quite well fitted by a Marčenko-Pastur distribution, which is however characterized by values of the q and σ parameters that differ from the ones which would be obtained for standardized uncorrelated data (q = N/T and σ = 1). Such features are further illustrated in Figure 3, which refers to the case of a factor model with no common mode (γ N = 0). Again, the progressive fusion (induced by weaker correlations) between separated eigenvalue bulks is shown. Also, we compare the numerically obtained spectra to the eigenvalue densities obtained from the solution of equation (43), obtaining a very good agreement between the two (Figure 3 (a)-(b)). Just like in the previous case, the Marčenko-Pastur distribution seems to provide quite a good fit of the "limiting" eigenvalue bulk obtained for small correlations (Figure 3 (c)). However, we performed a Kolmogorov-Smirnov [32] test under the null hypothesis of data distributed according to a Marčenko-Pastur distribution, and we found such hypothesis to be rejected for all the significance levels we considered (see the caption of Figure 3 for further details). On the other hand, the same test prevented us from rejecting the hypothesis of data distributed according to the eigenvalue density obtained from the solution of equation (43), its degenerate eigenvalues being given by equation (29). This is quite surprising, given the great similarity between the two densities (see Figure 3 (d)), which would be almost undistinguishable if plotted on the scale of the whole distribution (as in Figure 3 (c)). Nevertheless, we believe this result to be quite relevant, since it strongly suggests that the theoretical framework outlined in the previous subsection might be the right way to describe and analyze empirical correlation matrices which display a cluster correlation structure, at least to some extent. In the following section, we shall apply these ideas to financial data. We also believe these findings to provide some interesting evidence against the use of the Marčenko-Pastur distribution whenever non-negligible correlations are present between random variables. Despite being close, in a number of situations, to the eigenvalue densities deriving from the solution of equation (43), the Marčenko-Pastur distribution always needs to be fitted on the data under study, even when they are completely under control (as in the case of Monte Carlo simulations). Then, as already pointed out, the presence of correlations causes the parameters q and σ to deviate from the corresponding values which would be obtained in a pure noise situation. In particular, given the definition in equation (12), this leads to the introduction of the artificial, and possibly misleading, concept of effective system size. Eventually, concluding this subsection on Monte Carlo simulations, in Figure 4 we show a numerically obtained distribution of the largest sample eigenvalue for a factor model yielding the eigenvalue spectrum in (29). As it can be seen by direct inspection, the corresponding histogram is well fitted by a Gaussian distribution, as already anticipated in the previous subsection. Moreover, three statistical tests (whose details are provided in the caption) were performed under the null hypothesis of Normally distributed data, and all the results we obtained prevent from rejecting such hypothesis.

III. EMPIRICAL DATA: THE INVERSE PROBLEM
The goal of this section is to show that some of the features displayed in correlation matrix spectra of factor models are actually present in empirical spectra of financial correlation matrices too. In particular, our goal is to show that the empirically observed eigenvalue bulks, as the ones shown in Figure 1, cannot be regarded as a consequence of pure noise at all. Indeed, by suitably filtering empirical data, it is possible to show a peak separation similar to those shown in Figure 2 and Figure 3. Starting from the two data sets already introduced in a previous section (396 assets from the S&P500 Index and 243 assets from the FTSE350 Index), we shall restrict our attention only to a relatively small number of properly chosen assets. This will be done in order to ideally recreate, as best as possible, the conditions under which the previously discussed eigenvalue bulks emerge from factor models. We shall attempt to empirically recreate the block-diagonal correlation matrix in (27), which, when a single cluster is considered, yields the eigenvalue spectrum of equation (29). Let us rewrite that matrix in this specific case: recalling thatẼ( In order to reproduce the structure in (44), we start from the empirical covariance matrices (let us denote them as c, according to the previously adopted notation) of our data sets and apply the following procedure.
• We first identify a small cluster ofN strongly mutually correlated assets. If we denote the corresponding set of indices as I U , then we have for some threshold value ρ U > 0.
• Then, all assets which are weakly correlated to the elements in the cluster are pointed out. Amongst those, only the ones with small mutual correlations are retained. By grouping their indices in another set I D , we can write The first condition in (46) is meant to reproduce the zero off-diagonal blocks in (44), while the second one is meant to reproduce the identity matrix in the right-lower block.
• If we now redefine N to be the total number of stocks in I U and I D , so that I D contains N −N elements, and we properly sort them, then the approximation to (44) is given as follows where c IU and c ID are square matrices (of dimensionsN and N −N respectively) containing the correlation matrix elements pertaining to the two sets I U and I D . On the other hand, the c IU ,ID matrix (c ID ,IU being its transpose) contains the "interaction" terms between the two sets.
The goal of such a construction is to empirically make contact with the spectrum in (29). As a matter of fact, for suitably chosen threshold values ρ U , ρ ′ D and ρ ′′ D , we expect the eigenvalue spectrum of the c matrix in (47) to be the noise-dressed version of the one in (29). In particular, small values of ρ ′ D and ρ ′′ D should guarantee the c ID block to yield N −N eigenvalues close to one. On the other hand, the blockẼ (N) in (44) yieldsN − 1 small eigenvalues equal to 1 − ρ and a large one equal toNρ + (1 − ρ). Now, it is reasonable to assume ρ to be equal to the average mutual correlation between the assets in I U and to suppose that c IU will produceN − 1 eigenvalues close to this value. In the following we present and discuss the results we obtained applying this procedure to our datasets. In the case of the S&P500 Index, we identified a cluster made ofN S&P = 7 strongly mutually correlated assets (ρ S&P = 0.712, with ρ S&P computed as in (48)), all of which happen to belong to the energy sector. We then identified a group of 33 stocks, belonging to various sectors, which satisfy the previously described requirements: a small mutual correlation (mean value = 0.099) and a small correlation with theN elements in the cluster (mean value = 0.096). So, all in all we have N S&P = 40. Analogously, also in the FTSE350 Index case we were able to identify a cluster made of N FTSE = 7 highly mutually correlated stocks (ρ FTSE = 0.707), all corresponding to investment trusts. In this case, however, we only found 21 more stocks (so that N FTSE = 28) satisfying the aforementioned requirements (mean value of mutual correlation = 0.015, mean value of correlations with elements in the cluster = 0.014). In Figure 5, graphical representations of the empirical correlation matrices we obtained, and a comparison to the theoretically expected ones, are shown. As can be seen by direct inspection, the correlation matrix we obtain in the FTSE350 Index case is remarkably similar to the one in (44), whereas the one we obtain for the S&P500 Index has some further inner structure as a consequence of the much higher mean correlations. In Figure 6 the eigenvalue spectra we obtained from the previously discussed correlation matrices (the ones reported in Figure 5 (a)-(b)) are shown. In particular, in Figure 6 (a)-(b) we plot the spectra obtained, respectively, from the S&P500 and FTSE350 correlation matrices constructed according to the clustering procedure outlined previously. In both cases, two distinct eigenvalue bulks can be noticed. The smaller bulks on the left are made of 6 eigenvalues, and, since we haveN S&P =N FTSE = 7, this is in agreement with the prediction given by equation (29) forN = 7. Also, in the FTSE350 Index case ( Figure 5 (b)) the larger eigenvalue bulk around one is made of N FTSE −N FTSE = 21 eigenvalues, which is again in agreement with (29), while the largest eigenvalue in the spectrum (not shown in the plot) is equal to 5.235, remarkably close to the prediction given byN FTSE ρ FTSE + (1 − ρ FTSE ) = 5.242 (see again equation (29)). On the other hand, the spectrum relative to the S&P500 Index yields two large eigenvalues (not shown in Figure 6 (b)) equal to 3.552 and 6.483, and neither value is in agreement to the large eigenvalue prediction N S&P ρ S&P + (1 − ρ S&P ) = 5.272. Such discrepancy is due to the unresolved correlation structure in the empirical S&P matrix (see Figure 5 (a)), which gives rise to additional sub-clusters. In Figure 6 (c)-(d) the two eigenvalue bulks we just discussed are fitted with the eigenvalue density deriving from the second equation in (38) when applied to the solution of (43), i.e. the moment generating function m c of the noisedressed version of a correlation matrix C with degenerate eigenvalues. In both cases we consider correlation matrices with two degenerate eigenvalues in order to try to fit the two main bulks. The smaller eigenvalue Λ 1 , responsible for the emergence of the smaller bulks on the left, is assumed to be equal to 1 − ρ, accordingly to equation (29). So, in the two different cases we analyzed we have where the two slightly different weights are justified by the previously mentioned fact that in the S&P case there are two isolated eigenvalues which separate from the main bulks, while in the FTSE case there is only one such eigenvalue. On the other hand, the larger eigenvalue Λ 2 , which according to (29) should be exactly equal to one, is assumed to be equal to the empirical mean value of the main bulks on the right in Figure 6 (c)-(d). These are found to be and one might notice that, again, the value obtained in the FTSE case is in excellent agreement with the theoretically expected one. So, all in all, the curves drawn in Figure 6 (c)-(d) are obtained from the values in equations (49) and (50). Such curves, as already mentioned, are fitted to the empirical spectra. However, a bootstrap approach was adopted in order to improve the statistics. More specifically, for each bootstrap iteration a random sampling on the weakly correlated stocks was performed, picking 30 out of 33 in the S&P case and 18 out of 21 in the FTSE case. On the contrary, the stocks forming the highly correlated clusters were always kept (thus keeping the eigenvalue bulks on the left almost unchanged). As it can be seen in Figure 6 (c)-(d) the agreement between theory and prediction is very poor. This is essentially due to the additional correlation structures in the empirical correlation matrices (see Figure 5), which are neglected in the model matrix (44) and in its eigenvalue spectrum (29). All the bulks displayed in Figure 6 (c)-(d) appear to be "smeared" versions of their theoretical counterparts, even the small ones relative to the eigenvalues in (49). Interestingly, this shows that inhomogeneities in correlation structures have quite an impact on eigenvalue spectra even on a "small scale" (let us recall thatN S&P =N FTSE = 7). In Figure 6 (e)-(f) the same fit as the one just discussed is performed, the only difference being that an additional random reshuffling of the returns is performed on the bootstrapped assets. Such an operation is meant to destroy all possible correlations, and this leads to a quite good agreement between data and predictions on the bulks on the right (the theoretical densities being now computed with Λ 2 = 1, accordingly to equation (29)). This essentially confirms that the substantial deviations shown in Figure 6 (c)-(d) can entirely be imputed to the unresolved cluster structures in the empirical correlation matrices. The same kind of analyses (bootstrap and reshuffling) were not performed on the stocks belonging to the correlated clusters because of their very small number. (e)-(f) As in (c)-(d) but with weakly correlated data reshuffled before bootstrap, leading to a much better agreement between data and theory. The reference eigenvalue for the bulks on the right is now assumed to be equal to one (see main text).
All in all, the previous observations definitely suggest that the empirically observed eigenvalue bulks cannot be regarded as a consequence of the noisiness in financial correlation matrices. On the contrary, in the light of the previous discussions it could be conjectured that bulks emerge from the interplay of several cluster structures like the ones we isolated (see Figure 5) [12,13].

IV. SUMMARY AND CONCLUSIONS
Let us now summarize the main messages in the paper.
• Several rough but useful results about spectral properties of financial correlation matrices, such as the position of large non-degenerate eigenvalues, can be inferred by a clever application of the direct problem (see Section II A). This only involves algebraic calculations, namely the solution of suitable secular equations. This approach can be used either when the cluster structure is known a priori, or when there are good reasons to assume a certain correlation structure. Combining the direct analysis with Monte Carlo simulations can provide a clear picture in a number of situations, avoiding the analytical difficulties of Random Matrix Theory, and keeping the finite-sized nature of the problem. Typically, one wishes to reproduce observed spectra starting from a factor model, and this can be done as follows.
1. Identify the cluster structure in the dataset under analysis, using clustering algorithms [33].
2. Estimate the average correlations within clusters.
3. Build a theoretical, "mean field", matrix model C from the above estimates.
4. Run Monte Carlo simulations of the matrix model.

5.
Compare the outcome of the simulation to the empirical spectrum.
If the comparison is statistically satisfactory, the matrix model C can be retained and used for further analyses, such as portfolio selection. If not, the model is to be refined by abandoning the mean field assumption, at least for some cluster interactions.
• As far as the largest eigenvalue is concerned, its distribution is not Tracy-Widom, but Normal [30,31]. Moreover, this distribution cannot be derived from the thermodynamic limit formula (43). In fact, such an eigenvalue is typically non-degenerate and its weight in a diagrammatic expansion of the Green's function would vanish as 1/N , for N → ∞.
• For factor models, the bulks in empirical eigenvalue spectra come as the noise-dressed version of degenerate eigenvalues. Thus, such bulks encode the information on the cluster structure of the empirical correlation matrix c, and this can be evidenced by means of proper clustering methods, as done in Section III.
While there would be no difficulty in studying non-Gaussian multivariate models by means of Monte Carlo simulations, the analytical results presented in Section II B and in the Appendix cannot be easily generalized. In fact, the integrals needed to calculate g c in (34) in the Gaussian case can be exactly obtained by virtue of Wick's theorem, whereas different stochastic models would require painful calculations. The diagrammatic method outlined in the Appendix allows, in principle, for the exact evaluation of the Green's function g c for any finite size N × T , as a function of N and T . Nevertheless, this is a series of 1/z powers, whose convergence properties would be interesting to investigate in the near future.
where in the last line the identity T ×T matrix has been systematically inserted between R and R T matrices, whenever they appear. The z −1 matrices in equation (A1), being multiples of the identity matrix, could be safely pulled out of the expectation map (which is to be meant with respect to the probability measure in (10)). So, in order to compute a generic matrix element of the Green's function g c , one would need to compute n-point correlation functions of the following kind: E [R i1t1 R i2t2 . . . R i2nt2n ]. Following [12], two different kinds of indices have been considered in this expression: indices of the N -type (ranging from 1 to N ) and indices of the T -type (ranging from 1 to T ). Moreover, an even number of matrix elements has been considered, since the Gaussian probability measure (10) ensures the vanishing of all odd moments. Also, Wick's Theorem allows us to split any n-point correlation function into the sum of all possible products of two-point correlation functions (or propagators). For example, the four-point correlation function would read where the two-point correlation function is as in equation (7): Let us now represent the main ingredients in the Green's function expansion as follows (see Figure 7). N -type indices will be represented as black circles, while T -type indices will be represented as grey circles. A z −1 matrix element will be represented as a straight solid line connecting two N -type indices, while elements of the I T /T matrix will be represented as a dashed line connecting two T -type indices. The propagator in equation (A3), in turn, will be depicted as a double arc connecting two pairs of indices, each made of an N -type and a T -type index. With the previous positions, the first few terms in the expansion of the Green's function look like in Figure 8. The Green's function is represented as a grey circle within N -type indices, while the other diagrams correspond to the different contributions in the expansion in (A1). As can be seen, such diagrams are divided into two categories: those with crossing lines (such as the one in the fourth line of Figure 8) and those without crossing lines, also known as planar diagrams. It can be shown that the contribution of diagrams belonging to the former group vanishes in the infinite matrix limit (13). Intuitively speaking, this is because in planar diagrams closed loops (which give a contribution of order N for black lines and a contribution of order T for grey lines) and external horizontal lines (giving contributions of order 1/N for black lines and of order 1/T for gray lines) occur in equal numbers. So, in the thermodynamic limit the two contributions balance each other. On the other hand, non planar diagrams have extra 1/N factors which are not compensated by closed loops. So, in the thermodynamic limit, the Green's function is only composed of planar diagrams, whose building blocks are horizontal lines and "rainbow-like" structures, as it is easily seen from Figure 8.
More formally, such rainbow diagrams are usually called one-line-irreducible (1LI) since they cannot be split into two energy functions (which contain all possible 1LI diagrams). Recalling the form of propagators (see equation (A3)), such relations can be written as Σ(z) = C Tr [ g c (z)] (A7) Σ(z) = I T Tr [C g c (z)] .
Equations (A4), (A6) and (A7) form the so called set of Dyson-Schwinger equations, which can be solved for g c by consecutively eliminating Σ c , g c and Σ c . This yields where G C is as in equation (34). By carrying out all calculations explicitly one finds .
Recalling the definition of the moment generating functions (see equations (35) and (36)), one can see that the relations in (A8) and (A9) complete the derivation of equations (39) and (40), which was the goal of this Appendix. Eventually, let us mention (as already pointed out in [12]) that in the limit of very large samples, i.e. when T → ∞ with N fixed, one has q = 0 and consequently equations (A8) and (A9) yield g c (z) = G C (z). This, of course, would cause the corresponding spectral densities to be identical, giving a rigorous meaning to the intuitive statement that in the limit of a large number of observations the eigenvalue spectrum of the C matrix is faithfully reproduced by its estimator c.