Estimating Space and Space-Time Covariance Functions for Large Data Sets: A Weighted Composite Likelihood Approach

In this article, we propose two methods for estimating space and space-time covariance functions from a Gaussian random field, based on the composite likelihood idea. The first method relies on the maximization of a weighted version of the composite likelihood function, while the second one is based on the solution of a weighted composite score equation. This last scheme is quite general and could be applied to any kind of composite likelihood. An information criterion for model selection based on the first estimation method is also introduced. The methods are useful for practitioners looking for a good balance between computational complexity and statistical efficiency. The effectiveness of the methods is illustrated through examples, simulation experiments, and by analyzing a dataset on ozone measurements.


INTRODUCTION
The estimation of covariance functions sets an important problem for space and space-time random fields, especially in environmental, meteorological, or geophysical contexts. For instance, the kriging predictor (Cressie 1993) depends on knowledge of the covariance function of the random field. Since a covariance function must be positive definite, practical estimation generally requires the selection of some parametric classes of covariances and the corresponding estimation of these parameters.
The maximum likelihood (ML) and restricted maximum likelihood (REML) methods are generally considered the best options for estimating the covariance model parameters. Nevertheless, evaluation of the objective function under the Gaussian assumption requires to solve a system of linear equations. For a dataset of n observations, the computational burden is O(n 3 ), making this method computationally impractical for large datasets. This fact motivates the search for estimation methods with a good balance between computational complexity and statistical efficiency.
The alternatives can be classified in two sets: methods that try to build an approximation of the covariance matrix of the observations, which is easy to invert and methods that consider a different objective function.
Significant computational gain is achieved when the sampling locations form a regular lattice. In this case, the spectral representation of the Gaussian process allows us to build a computationally tractable approximation of the log-likelihood (Whittle 1954). Rue and Tjelmeland (2002) approximated the inverse of the covariance matrix to be the precision matrix of a Gaussian Markov random field wrapped on a torus.
For irregularly spaced data, Fuentes (2007)  over grid cells, obtaining an approximation to the likelihood on a lattice structure. Another idea (Cressie and Johannesson 2008;Stein 2008) is considering a low rank structure on the covariance matrix. This allows us to calculate the inverse and the determinant of a large covariance matrix by inverting and calculating the determinant of a matrix of lower dimension.
Since inversion is less demanding when using algorithms for sparse matrices, part of the literature has been focused on methods allowing for sparse matrices. In the tapering approach, suggested by Kaufman, Schervish, and Nychka (2008), these are obtained through the Schur product of the covariance matrix with a matrix coming from a compactly supported correlation function.
Other methods use a different objective function whose optimization leads to estimators that keep satisfactory statistical properties. The composite likelihood (CL) (Lindsay 1988) is a likelihood-based approximation, indicating a general class of pseudo-likelihoods based on the likelihood of marginal or conditional events. In general, a loss of statistical efficiency is expected from CL estimation with respect to likelihood methods, but its computational tractability makes the CL methods appropriate when working with large datasets. Known examples can be found in Vecchia (1988), Stein et al. (2004), Stein (2005), and Caragea and Smith (2006).
A particular form of CL based on differences has been introduced by Curriero and Lele (1999) in a spatial framework. This kind of CL is the natural approximation of the REML estimation since both are based on the constraints distribution and cope with the covariance parameters only.
In this article, we modify and extend this approach to a spatiotemporal context. More precisely, we propose two estimation methods. The former relies on the maximization of a weighted version of the CL function where we use cutoff weights based on the distance in space and in time. We prove that the associated estimator is consistent and asymptotically Gaussian under increasing domain asymptotics (Cressie 1993). The latter is based on the solution of a weighted CL equation. The weights are obtained by minimizing an upper bound for the asymptotic variance of the estimator. The weighting scheme is completely general and can be applied to a nonspatial framework.
We then address an important issue, for example, how to select an appropriate space-time model by taking into account the trade-off between goodness-of-fit and model complexity. The methods that are available in the literature (see, for instance, ) use information criteria such as the Akaike information criterion and the Bayesian information criterion, both based on the evaluation of the likelihood. We propose an information criterion for covariance model selection based on weighted composite likelihood (WCL) function that exploits recent advances in model selection criteria (Varin and Vidoni 2005).
Moreover, a distinctive feature of the methods is that they are less reliant on strong distributional assumptions of the random fields.
The remainder of the article is organized as follows. In Section 2, we introduce our proposals for space and space-time covariance estimation while in Section 3 we describe the asymptotic properties of the WCL estimation method. The information criteria based on WCL is proposed in Section 4. In Section 5, we show through simulation experiments that our methods represent a satisfactory approximation to REML estimation. As a real data example, we apply our methods on a dataset of ozone hourly measurements across the eastern United States (Fuentes 2003). Finally, in Section 6, we give some conclusions.

CL ESTIMATION
Let {Z(s, t), s ∈ R d , t ∈ R} be a real-valued space-time random field with finite and constant variance.
The high order of complexity of space-time interactions calls for simplifying assumptions, such as those of intrinsic or weak stationarity, that have implications on the existence of the moments of the random field. A random field is weakly stationary if E[Z(s, t)] = µ, for all s, t, and the covariance between Z(s, t) and Z(s , t ) depends only on h = s − s ∈ R d and u = t − t ∈ R, the spatial and temporal separation vectors, namely, cov[Z(s, t), Z(s , t )] = C (h, u). It is intrinsic stationary if E[Z(s, t)] = µ, and the variance of the difference between Z(s, t) and Z(s , t ) depends only on h and u, namely, var[(Z(s, t) − Z(s , t ))] = 2γ (h, u). The function 2γ is called variogram and under weak stationarity we have 2γ (h, u) = 2(C(0, 0) − C (h, u)). In the sequel, we consider weakly stationary random fields with a covariance function C(h, u; θ) known up to the parameter vector θ ∈ R p that must be estimated on the basis of a finite number of n observations Z = (Z(s 1 , t 1 ), . . . , Z(s n , t n )) . Curriero and Lele (1999) proposed a special case of CL estimator (Lindsay 1988) based on the differences between the observations. The reason for considering such differences is that, as the REML estimation method, it allows us to remove the mean which is a nuisance parameter. We follow that article and first consider all possible space-time differences Assuming independence between the differences, we get the com-posite (log) likelihood function (1) The maximum CL estimator θ cl locates the maximum of (1). If we assume that the differences are Gaussian distributed with zero mean, that is, U ij ∼ N (0, 2γ ij (θ)), we get where γ ij (θ) = γ (s i − s j , t i − t j ; θ). A clear advantage is that evaluation of (2) does not require any inversion matrix and the order of computation for CL estimator is O(n 2 ). By construction, the associated composite score is unbiased. Here, f (1) is the vector of partial derivatives of a function f with respect to θ. Actually, the previous result does not depend on distributional assumptions about U ij . Since the CL method results in an unbiased estimating function, in principle, it is possible to improve the efficiency of the estimator, exploiting the theory of optimal estimating equations (see, e.g., Andersen 2004;Kuk 2007).
If we stack all the individual scores l (1) ij (θ), j > i into a vector S(θ), then the optimal estimating function is (Heyde 1997) provided cov[S(θ)] is nonsingular. Simple algebra shows that an entry of the matrix cov[S(θ)] depends on cor[U 2 ij , U 2 lk ] whose expression is, for the case of a stationary zero-mean Gaussian random field, However, evaluation of cov[S(θ)] is a key issue because of its dimension, making it hard to improve the efficiency of the estimator in this way. A simpler solution can be obtained by downweighting those couples of observations that are far apart in space by a weight w ij ≥ 0 and obtaining the weighted composite (log) likelihood (WCL) There are many possible choices for the weights. The simplest case is when w ij = 1, if s i − s j ≤ d S and |t i − t j | ≤ d T , 0 otherwise, for a fixed spatiotemporal lag d = (d S , d T ) . This choice has evident computational advantages, but can also improve the statistical efficiency. Let us illustrate this fact in the spatial case. We first introduce the Godambe information G n (θ) associated to the WCL in Equation (5): where and Here,f (2) is the Hessian matrix of the function f with respect to θ. The inverse of G n (θ) is an approximation of the asymptotic variance of the WCL estimator (see Section 3). Note that the Godambe information depends on d by means of w ij and in the sequel we shall stress this through the abuse of notation G n (θ; d).
For illustration purposes, we consider two examples of covariance models for a spatial random field observed on 64 points located over a 8 × 8 regular grid {0, 1, . . . , 7} 2 : 1. the exponential model 2. the wave model The models are parameterized in terms of the practical range, that is, C(h)/C(0) 0.05 when h = φ. We use and compare these two models since the former is an example of correlation function attaining only positive values and having a monotonic decay while the latter can attain positive or negative values with a decay being not monotonic.
We consider σ 2 is known and we estimate φ. In Figure 1, we show the asymptotic relative efficiency (ARE) of the estimator derived from (5) with respect to the REML estimator for different values of the distance d S .
Note that, as expected, the efficiency decreases when the range φ increases, but the efficiency decreases with the distance d S . In practice, if we use all differences we may lose efficiency since too many redundant pairs of observations can skew the information confined in pairs of nearby observations. However, the same comments do not apply to the wave model. In this case, we need to consider the whole information in the pairs. Note also that for φ = 8, the ARE is very low.
The example suggests that choosing d in a suitable way under specific models can improve significantly the statistical efficiency of the estimate. Moreover, our findings agree with the results by Davis and Yau (2010) for time series models. These authors found that choosing weights so that the CL is constructed only from adjacent pairs improves the efficiency preferable to the "full" CL.
Therefore, using a scalar equivalence result (Heyde 1997), a sensible strategy for improving the efficiency of the CL estimator can be seeking the "optimal" d * such that where D is a finite set of space-time lags. If the data are observed on a regular grid on space and time, the definition of D is simple, but, in general, an inspection of the variogram cloud could help with the choice of the lags in D.
The lag d controls both the statistical efficiency of the estimation and the computational efficiency. In general, there is no trade-off between statistical and computational efficiency as highlighted in the examples. In this sense, the method is different from other estimation methods of covariance functions (Stein, Chi, and Welty 2004;Kaufman et al. 2008).
Summarizing, evaluation of the WCL estimate involves three steps: 1. Get a preliminary and consistent estimateθ of θ. In our implementation, we have used the weighted least squares estimates (Cressie 1993, p. 91) based on empirical estimation of the variogram function. This method is one of the most widely used estimation methods for space-time data (see, e.g., Cressie and Huang 1999;Gneiting, Genton, and Guttorp 2007); 2. Minimize tr(G −1 n (θ; d)) with respect to d in order to get d * . An estimate of G n is computed using subsampling techniques as explained in Section 3.4; 3. Find the minimum of wcl(θ).
The 0/1 weights can be a simple and effective solution but we can use smoother weights. Formula (4) has also shown the need of adaptive weights, that is, weights depending on the covariance model. However, the evaluation of cov[S(θ)] is the point at issue.
A possible strategy for circumventing this difficulty is the following. Assume that θ is scalar and consider the estimating function where w is the vector of the weights w ij . The asymptotic variance of the estimator associated to this estimating function is . This variance is bounded by where λ min (θ ) and λ max (θ ) are the minimum and maximum eigenvalues of cov[S(θ )], respectively, and B(w) = w w (w h(θ)) 2 . We can improve the upper bound minimizing the function B(w). It is easy to show that the minimum of B(w) is achieved when w = ch(θ ), for c = 0, and we put c = 1 without loss of generality. Obviously, the weighted version of the CL, with w = h(θ ), can assure an effective improvement in the efficiency with respect to the unweighted one (1) Actually, a less restrictive condition is that This discussion suggests that we can use a weighted composite score (WCS) where , to build an estimator that can improve the statistical efficiency.
Going back to our previous example, Figure 1 shows the gain in efficiency using a WCS. Note that we have an effective improvement only for the exponential model, namely, when we consider a wave model, the simple 0/1 weight scheme outperforms the adaptive ones. The extension of the previous argument to the multiparameter case is overlay complicated. A simple solution comes from using different weights for each composite score component and getting a simplified version of that defined in Equation (14), allowing to obtain the WCS where The weight function depends on the parameters to estimate. For instance, if we consider only the variance parameter, then from (7) we obtain w ij (σ 2 ) = IE [−l (2) ij (σ 2 )] = 1 2σ 4 . In this case, WCS method does not modify the solution with respect to the unweighted version since the weight function is constant.
If we consider a range parameter of an isotropic covariance function that is completely monotone on the positive real line, then we can prove that the weight function is monotonically decreasing with the distance. In this case, pairs of observations that are far apart in space are smoothly downweighted.
For the multiparameter case, our proposal is shown to be effective through some examples. Figure 2 compares the ARE when we estimate the range and variance parameter in models (9) and (10). As an overall summary, we use the square-root of the ratio of the determinants of the asymptotic variance matrices. Again, the gain in efficiency is clear for the exponential model. Note that, since the score component relative to the variance is unweighted, the gain is mainly due to the improvement in estimating the range parameter. Summing up, WCL (5) or WCS (15) are two possible ways for estimating covariance models. If the efficiency is not a big issue, in the first case, we can build an inference method based on a contrast function that allows simpler asymptotic results and derivation of a model identification criterion (see Sections 3 and 4). Moreover, WCLs lead to optimization algorithms that are faster, by setting the maximum number of pairs allowed for each site, and that do not show multiple solution problems that can arise in solving estimating equations (Heyde and Morton 1998).

Setup
Two types of asymptotic frameworks are considered in spatial statistics. In the increasing domain framework, the domain of observation expands as the distance between any pair of locations is bounded away from zero. Instead, fixed domain asymptotics deals with a fixed region within which the number of sampling locations increases.
Note that the asymptotic behavior of spatial covariance parameter estimators can be quite different under the two frameworks (Stein 1999;Zhang 2004). For a comparison of these two frameworks, see Zhang and Zimmerman (2005). We focus on increasing domain asymptotics and we use the same setup as in Li et al. (2008).
In order to explain with clarity the results following throughout this section, we need to adopt some abuse of notation that is to be considered as independent with respect to the notation adopted throughout the remainder of the article. We suppose that the sampling region R n is a region in R d+1 and we factorize R n into R n = I n × F, where I n ⊂ R d 1 , F ⊂ R d 2 and d 1 + d 2 = d + 1. This notation reflects two space-time settings: (1) if d 1 = 1 and d 2 = d, we are considering an increasing in time domain, I n , with the spatial domain, F, fixed.
(2) if d 1 = d and d 2 = 1, we are considering an increasing spatial domain with the temporal one fixed.
The set I n = A n ∩ Z d 1 is a part of the integer grid Z d 1 that lies inside a sampling region A n obtained by "inflating" a prototype set A 0 by the scaling factor λ n , that is, A n = λ n A 0 . Here, A 0 ⊂ (−1/2, 1/2] d 1 is an open connected neighborhood of the origin and {λ n } ∞ n=0 is a sequence of positive real numbers increasing to ∞. Setting (1) (a small number of locations and a large number of measures in time) reflects a common setting in environmental studies, and this is the case of our real data example. Setting (2) is useful in a pure spatial setting (i.e., |F | = 1) when the sites are located on a regular lattice.
Next, we suppose that the set of the data sites is given by R n = {s 1 , . . . , s n } ⊂ R n . When considering the 0/1 weights in (5), it is not necessary to include all distinct pairs of observations. In fact, we introduce a finite set M = {h 1 , . . . , h q }, h i ∈ R d+1 that does not contain the origin and determine which pairs of observations contribute to the product for each considered site s i . The set M defines the neighborhood of s i and we denoteR = {s ∈ R n : s + h ⊆ R n , ∀h ∈ M}, the M-interior of R n and ∂R n = R n \˚n, the M-boundary of R n . Note that the size |R n | of the sampling region R n is O(λ d 1 n ) and |R n |/|R n | → 1, as n → ∞.

Instead of considering the CL
we consider a modified version of (16), namely, which differs from (16) by an edge-effect and allows for interpreting (17) as a sum of identically distributed random variables, simplifying the proofs reported in the Appendices.

Consistency
In order to show consistency and asymptotically normality of the composite log-likelihood estimator, we use standard results on minimum contrast estimators (Guyon 1995, p. 119). We write the composite log-likelihood (17) as a contrast process and the CL estimator is the valueθ n , such that Proposition 1. Assume that: (C1) Z is a stationary and ergodic random field, with is a compact subset of R p and the true value θ 0 belongs to • ,the interior of ; (C3) for each fixed h ∈ M, the mapping θ → γ (h; θ) has continuous partial derivatives with respect to θ, and inf θ∈ γ (h; θ) > 0; (C4) the function has a unique global minimum over at θ 0 Then,θ n → θ 0 , n → ∞, in P θ 0 -probability.
Remark. (C4) is an identifiability condition. Under Gaussianity, for each h ∈ M, the function θ → log γ (h; θ) + γ (h;θ 0 ) γ (h;θ) has a global minimum at θ 0 (because essentially it is a Kullback-Leibler divergence), but in the multidimensional case (p > 1) θ 0 fails, in general, to be the unique minimizer. A necessary condition for the unicity is that p ≤ q, where q = |M|.

Asymptotic Normality
For the asymptotic normality, we need to quantify the strength of dependence in the random field. The strong mixing coefficient for a random field is defined as for a > 0, b > 0, and where the supremum is taken over rectangles E 1 , E 2 in R d+1 , F(E) denotes the σ -algebra generated by the random variables {Z(s), s ∈ E}, d(E 1 , E 2 ) is the distance between the sets E 1 , E 2 , defined as d(E 1 , E 2 ) = inf x∈E 1 ,y∈E 2 sup{|x j − y j |, j = 1, . . . , d + 1}. Under the fixed notation and stated conditions, we are now ready to state our result about the asymptotic distribution of the CL estimator.

Remarks:
1. Condition (N 1) is satisfied if Z is a stationary Gaussian random field with correlation function that decays at a polynomial rate faster than d 1 and its spectral density is bounded below due to Corollary 2 of Doukhan (1994, p. 59). 2. The finite sample approximation to the asymptotic variance is the inverse of the Godambe information associated to the estimating functionQ (1) n (θ) and evaluated at θ 0 . We can also prove that the edge-effects are negligible, so we have avar( θ n ) ≈ H −1 n (θ 0 )J n (θ 0 )H −1 n (θ 0 ) , where H n (θ) and J n (θ) are defined by (7) and (8), respectively.

Standard Error Evaluation
The evaluation of the standard error requires consistent estimation of the matrices H n (θ 0 ) and J n (θ 0 ). However, the evaluations of the plug-in estimates H n ( θ n ) and J n ( θ n ) are of order O(n 2 ) and O(n 4 ), respectively. The latter becomes computationally unfeasible for large datasets. Note that asymptotic normality entails that W −1 n J n (θ 0 ) → J (θ 0 ), as n → ∞, where J (θ 0 ) is a positive definite matrix (see the Appendix for its definition) and W n = n i=1 n j>i w ij . Our strategy here is to use subsampling techniques to obtain an estimate J n ( θ n ) of J (θ 0 ) and then estimate J n (θ 0 ) by W n J n ( θ n ).
For S 1 , . . . , S m n m n overlapping subsets of {s 1 , . . . , s n }, Heagerty and Lumley (2000) proposed as an estimator of J (θ) the following expression: where W (k) n = (i,j )∈S k w ij . Finally, the asymptotic variance of θ n can then be estimated using the subsampling approximation In general, the choice of the optimal (in the sense of minimizing the mean square error (MSE) of the estimator) window size is not an easy task since it depends on several aspects such as the dimension of the lattice and the underlying dependence structure.
Assuming that the data are observed in the region R n defined in Section 3.1, under increasing in time domain the overlapping subsets are S i = I (i) n × F , where I (i) n are overlapping temporal subwindows. The size of the ith subwindow |I (i) n | can be determined following Carlstein (1986) as In that article, the author assumes that the statistic of interest is the simple mean and that the temporal correlation follows an AR(1) process with parameter β. An estimate of β is given by β = C(0, 1)/ C(0, 0). Li, Genton, and Sherman (2008) used this procedure to estimate the covariance matrix for testing some properties of space-time covariance functions.
Under an increasing spatial domain for a set of overlapping spatial subwindows on a regular lattice (see, e.g., Sherman 1996; Lee and Lahiri 2002;Li et al. 2008), the optimal size of the sub-window is |S n | = Cn d /(d+2) . The constant C is difficult to specify and in this context a careful sensitivity analysis is required for establishing its value.

A MODEL SELECTION CRITERION BASED ON THE WEIGHTED CL ESTIMATOR
Model selection is a major problem in spatial and space-time statistics and some few criteria have been proposed in the literature (see, e.g., Huang and Chen 2007). Hoeting et al. (2006) proposed a model selection criteria based on AIC and BIC in the spatial and space-time setting. Their proposals rely on the evaluation of the likelihood for a Gaussian random field. Unfortunately, in the case of massive datasets, the selection criteria based on the full likelihood fail due to their computational burden. This drawback can be overcome by using the CL framework, and recasting the approach proposed by Varin and Vidoni (2005).
We consider a parametric family of covariance functions F = {C(·; θ), θ ∈ }, with a compact subset of R p . This family may or may not contain the true covariance function C(·; θ 0 ). Thus, we face two possible sources of misspecification: (a) we may choose a parametric class of covariance functions that does not contain the true covariance function and we are using a likelihood approximation and (b) the misspecification is endemic to the proposed methodology that constitutes an approximation to the full likelihood.
Our goal is to compare different models in terms of their predictive ability at forecasting a new set of dataZ. Varin and Vidoni (2005) suggested to use the composite Kullback-Leibler divergence defined as the linear combination of the Kullback-Leibler divergences associated to each component of the CL. In our setup, the composite Kullback-Leibler divergence of a model, identified with θ, with respect to the true distribution is given by The model selection can be performed by minimizing the expected composite Kullback-Leibler divergence of the estimated model using WCL relative to the true distribution, namely, where the second expectation is taken assuming θ n as fixed.
A natural estimator of (θ 0 , θ) comes from the maximized composite log-likelihood wcl( θ n ), which is a biased estimator, and the bias is of the order of the trace of J n (θ * )H n (θ * ) −1 , where θ * ∈˚ is the pseudo-true value, that is, the value minimizing the Kullback-Leibler divergence of the models of F with respect to the true distribution. Following Varin and Vidoni (2005), the CL information criterion selects the model maximizing the composite likelihood criterion (CLIC)  where the routine for estimation of J n and H n has been previously illustrated. The effectiveness of this criterion will be illustrated in the next section.

Simulated Data
In this section, we compare the statistical and computational efficiency of CL, WCL, and WCS estimation methods with respect to the classical REML estimation for two models in space and in space time with different number of observations. In both cases, the number of observations was chosen in order to make the evaluation of the restricted likelihood rather simple. Then, the estimates were compared in terms of the MSE evaluated by a Monte Carlo method. For this, we simulated 1000 replications from a Gaussian random field with a given covariance function.
All calculations were carried out on a 3.2 GHz dual processor with 16 GB of memory and the estimation procedures were implemented coupling R functions and C routines. Some of these routines are collected in the R package BLINDED.
We first considered a spatial example with exponential covariance function (9) and parameters σ 2 = 1 and φ = 2, 4, 6. We assumed to observe the random field over different spatial grids, R n = [−n + 1, . . . , n] 2 , n = 5, 8, 12. This setting reflects scenario (2) in Section 3. Table 1 shows the relative statistical efficiency of the CL, WCL, and WCS estimates with respect to the REML ones. On general grounds, the most critical part is in estimating the parameter driving the spatial dependence (φ). The WCL estimates show better performances with respect to the unweighted version. Moreover, WCS always outperforms the WCL, giving efficiencies close to the REML method. All methods tend to get worse when the spatial dependence increases, although the loss in efficiency of WCS is slower, always keeping a good compromise.
We then used the spatial grid of the previous example but with N = 20, 25, . . . , 75, that is, 400, 625, . . . , 5, 625 observations, to show an example of the practical computational performances of the WCL methods. Figure 3 reports the computational time (in terms of seconds of real elapsed time) in evaluating the CL function and the WCL function taking the difference whose distance is lower than half the maximum distance on the grid, and compare it with the computational time of the REML method. The computational gains are substantial and increase when increasing the number of observations. For instance, for 900 observations, the ratio between CL and REML times is 0.1586, while the ratio between WCL and REML is 0.0689. When considering 5,625 observations, the ratios are 0.0301 and 0.0212, respectively. We do not report the WCS time since it is the same of the CL method.  Table 2. ARE in terms of the square-root of the ratio of the determinants of the asymptotic variance matrices for CL, WCL, and WCS method. The estimates are spatial and time range parameters, variance, and separability parameter of 1,000 Gaussian random fields with Gneiting model (24), increasing the time and the space-time range In a further step, we performed a simulation study for the space-time covariance function proposed by Gneiting (2002): with θ = (σ 2 , a, c, α, β) . Here, we set σ 2 = 1, c = 2, 3, 4, a = 5, 10, 20, and α = β = 0.5, mirroring a situation with increasing space and time correlation. In our experiment, α has been kept fixed.
We considered space-time grids increasing in time R n = I n × F with F = [−1, . . . , 2] 2 , and I n = {1, . . . , T n }, T n = 40, 80. This setting reflects scenario (1) in Section 3 and tries to mimic the real example presented in Section 5.2. Table 2 reports the ARE for the three methods when applied to the previous space-time context. In this case, we report for shortness only the trace but the crucial parameters are the space and time range. As is the former example when increasing the spatial and/or temporal dependence, the ARE get worse and WCS always outperforms the WCL and the unweighted version.
Again, parameter α has been kept fixed in models (2) and (3). Models (1) and (2) are separable models, that is, , and models (2) and (3) are nested. Note that when comparing models with the model selection criteria, the same common distance must be used in the WCL estimation.
Estimation of J and H are performed as described in Section 3.4 and in estimating J we used overlapping subwindows in time.
The results are summarized in Table 3, reporting the number of cases in which a particular model is identified with respect to the model chosen for the simulations, that is, the "true" model.
These results point out that the CLIC is promising identification criterion because at least 70% of the models have been correctly identified even for a relatively small number of observations T n = 100.

A Real Data Example
We consider maxima of hourly means over eight consecutive hours of ozone levels measured at 513 stations in the United States (see Figure 4) from May 1, 1995, to October 31, 1999, giving 184 observations per site and per year. The whole dataset can be obtained from www.image.ucar.edu/Data. The dataset was first analyzed by Fuentes (2003) who suggested a nonstationary spatial model for the 3-year average of the annual fourth-highest daily maximum 8-hour average ozone concentration. Gilleland and Nychka (2005) performed a spatiotemporal analysis on a subset of 72 monitoring stations in a study region centered at North Carolina. Very recently, Matsuo et al. (2011) fitted a nonstationary spatial model using data for 1997, and assumed temporal independence.
We restrict our attention on the last 2 years (1998-1999) and we keep apart the last 60 days for assessing the prediction performance. Here, we do not want to carry out a new data analysis but, instead, we want to offer an example with a dataset that is quite challenging for WCL and WCS methods because we deal with 158,004 observations. In our dataset, 8,845 values are missing, but this fact does not affect our estimation procedure.
Because of the seasonal dependence of the ozone, we standardized the data as in Gilleland and Nychka (2005) and supposed that the daily maximum 8-h ozone measurement, at location s and day t, Z(s, t), can be decomposed into a large scale spatiotemporal µ(s, t) µ(s, t) = a(s) + and a small-scale stochastic component, ε(s, t), with variance that varies over the space, namely, The parameters in the seasonal component were estimated by ordinary least squares and σ (s) was estimated from the residuals. Moreover, the large-scale components were smoothed over the space using a singular value decomposition approach (see Gilleland and Nychka 2005, for more details). We modeled the small-scale component ε(s, t) as a Gaussian random field with zero mean, unit variance, and correlation (i.e., σ 2 = 1): model 1. the double exponential model (25); model 2. the separable model (24) with α = 0.5, β = 0; model 3. the separable model (24) with β = 0; model 4. the nonseparable model (24).
We started from a preliminary weighted least squares (WLS) estimate (Cressie 1993), fixing a small number of relevant lags in space and in time in order to identify the distance d = (d S , d T ) for the WCL estimate. However, because we want to select the best model on the basis of the CLIC, we used the most complex model D to identify a common distance d. According to the criterion (11), we found d = (400, 3) . Moreover, practical implementation of WCS for a dataset of such dimension exploits the monotonic behavior of the weights of the different weighted estimating equations for the correlation parameters. The preliminary estimates of the weights, calculated again by WLS method, are nearly zero for a couple of observations such that h > 800, |u| > 6. So, these pairs were dropped in the estimation procedure, allowing a remarkable computational gain.
The estimation results for models 1-4 are summarized in Table 4 for WLS, WCL, and WCS, where standard errors were estimated using subsampling techniques with a temporal window as explained in Section 3.4. Estimations for WCL and WCS methods are similar and this fact is confirmed by Figure  5 where we compare the empirical variograms versus the estimated ones for model D.
In all cases, the estimated parametric variograms show a good agreement with the empirical ones. The CLIC criterion (see the last row of Table 4) suggests evidence in favor of the nonseparable hypothesis for the covariance models. However, the estimated values for models 3 and 4 are on the boundary of the parameter space, weakening the inferential conclusions using the standard errors and the CLIC.
In spite of this remark, the predictive performances of the four models were in favor of model 4. We considered the ozone measures during the last 60 days and calculated for each station 60 × 513 = 30, 780 1 day ahead predictions, Z(s, t), using the simple kriging predictor (Cressie 1993).
Kriging prediction is challenging for large datasets (Cressie and Johannesson 2008) and here we used as a predictor the last four daily observations recorded at the 513 stations. In order to assess and rank the point forecasts, we used the root mean square error (RMSE) RMSE = 513 i=1 368 t=309 (Z(s i , t) − Z(s i , t)) 2 /30780. Looking at the RMSE results in Table 5, the nonseparable model D showed the best results for both WCL and WCS methods, with a slight advantage for the last one.

CONCLUSIONS
We proposed two methods for estimating covariance functions for datasets, which are large in space or in space time.
The first method is based on a particular instance of the CL of Curriero and Lele (1999). We have gained more statistical efficiency in considering a weighted version of the composite likelihood (WCL) where the weights are based on distances among the observed locations. Moreover, we have dealt with the model selection task, suggesting a suitable criterion that fits in the framework of CL. The effectiveness of this criterion has been shown by simulations.
In the second method, we have abandoned the construction of likelihood-type functions to resort on weighting score-type functions (WCS). Here, the weights are chosen minimizing an upper bound of the asymptotic variance of the estimates. In this regard, the weighting scheme proposed can be considered general and can be even applied in a nonspatial framework.
We have shown through examples and simulations that both methods give satisfactory results with respect to the REML estimates when we look for a reasonable trade-off between statistical and computational efficiency in dealing with large datasets. Our findings show that WCS performs better than WCL CL. However, WCS still requires more theoretical evidence especially when we consider multiparameter cases and it is heavier from computational point of view. Moreover, model selection based on a information criterion is an open issue even if solutions can be founded in a different context (Pan 2001).
Finally, we remark that even if our methods provide a feasible alternative for the estimation of the covariance function in large datasets, they do not tackle the prediction task and this will be a topic for future research.

APPENDIX B: PROOF OF PROPOSITION 2
We proceed by verifying the conditions (H1-H2-H3) of Theorem 3.4.5 in Guyon (1995) (H1) If f has continuous partial derivatives of order 2 at θ ∈ R p , we define f (2) (θ) as the p × p matrix of second derivatives