Time-Resolved Sentinel-3 Vegetation Indices Via Inter-Sensor 3-D Convolutional Regression Networks

Sentinel missions provide widespread opportunities of exploiting inter-sensor synergies to improve the operational monitoring of terrestrial photosynthetic activity and canopy structural variations using vegetation indices (VI). In this context, continuous and consistent temporal data are logically required to rapidly detect vegetation changes across sensors. Nonetheless, the existing temporal limitations inherent to satellite orbits, cloud occlusions, data degradation, and many other factors may severely constrain the availability of data involving multiple satellites. In response, this letter proposes a novel deep 3-D convolutional regression network (3CRN) for temporally enhancing Sentinel-3 (S3) VI by taking advantage of inter-sensor Sentinel-2 (S2) observations. Unlike existing regression and deep learning-based methods, the proposed approach allows convolutional kernels to slide across the temporal dimension to exploit not only the higher spatial resolution of the S2 instrument but also its own temporal evolution to better estimate time-resolved VI in S3. To validate the proposed approach, we built a database made of multiple day-synchronized S2 and S3 operational products from a study area in Extremadura (Spain). The conducted experimental comparison, including multiple state-of-the-art regression and deep learning models, shows the statistically significant advantages of the presented framework. The codes of this work will be made available at https://github.com/rufernan/3CRN.


I. INTRODUCTION
T HE increasing demand for remote sensing (RS) data strongly encourages the expansion of different Earth observation (EO) missions and programs to cover current societal needs as well as future challenges [1]. In this sense, large investments are constantly made to guarantee the operational provision of high-quality RS imagery by means of different artificial satellites and constellations that pursue to satisfy the technical requirements of numerous application domains [2]. Being one of the most important intergovernmental organizations, the European Spacial Agency (ESA) aims at addressing present and future EO needs by providing continuous remotely sensed data useful for scientific purposes and operational applications that benefit worldwide citizens. To achieve this goal, the European Commission in partnership with ESA coordinate and manage the so-called Copernicus program and additional Earth Explorer missions that focus on different aspects of EO, such as, atmosphere, biosphere, hydrosphere, or cryosphere. Within the Copernicus program context, Sentinel-2 (S2) and Sentinel-3 (S3) share one of the most important synergies since both families of satellites make use of multispectral instruments (MSIs). In the case of S2 [3], it includes a couple of satellites (S2A and S2B) that contain the MSI, which is able to capture the Earth surface with a 10-60-m spatial resolution using 13 spectral bands (B01-B12) from the 443-to 2190-nm wavelength range. In the case of S3 [4], this mission also comprises two satellites (S3A and S3B) that carry, among other sensors, the Ocean and Land Color Instrument (OLCI), which is able to provide a finer spectral resolution based on 21 bands (Oa01-Oa21) from the 390-to 1040-nm wavelength region but using a coarser spatial resolution of 300 m. Despite the fact that both MSI and OLCI instruments are able to provide operational data related to vegetation, land, and water, their fundamental spatial-spectral differences make each mission particularly suitable for specific applications [5]. Whereas S2 becomes more convenient for land cover characterization tasks where the spatial precision is important [6], OLCI's spectral features make S3 better at capturing global information from oceans, inland waterways, and coastal areas [7]. Nonetheless, the broad scope of these Sentinel missions still motivates the development of additional Earth Explorers focused on key scientific challenges and breakthrough technologies for advancing the understanding of Earth systems and their functional interactions.
Focusing on terrestrial vegetation, ESA is currently developing the Fluorescence Explorer (FLEX) [8] which has been designed to synergistically work with S3 mission. In this case, FLEX has a single satellite that will fly in tandem with S3 for remotely measuring the solar-induced chlorophyll fluorescence (SIF) emitted by plants as an accurate indicator of the actual photosynthetic activity [9]. Nonetheless, detecting SIF emissions from space is very challenging since the fluorescence signal is very weak with respect to the 1558-0571 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
total acquired radiance. In response, FLEX will carry the fluorescence imaging spectrometer (FLORIS) which counts with a particularly ultra-fine spectral resolution (focused on the 500-780-nm wavelength range) and 300-m spatial resolution (likewise OLCI). Even though these specifications are convenient for decoupling SIF emissions from vegetation reflected light, they become rather limited to characterize other important aspects, like atmospheric features. Precisely, FLEX will orbit only few seconds before one of the S3 satellites to take advantage of the OLCI instrument for supporting FLORIS products and providing additional value [10]. Certainly, FLEX/S3 tandem mission will offer a great advancement in assessing vegetation dynamics while providing widespread opportunities of exploiting inter-sensor synergies within the Copernicus program. In this regard, the growing development of vegetation products that involve multiple satellites exemplify this trend [11], [12]. Specifically, the majority of methods tackle this inter-sensor problem from a data fusion perspective where the temporal availability of data becomes a major challenge. It is noted that, when accounting for the accurate monitoring of terrestrial vegetation, even relatively small temporal deviations may produce relevant vegetation changes. Therefore, continuous and consistent inter-sensor data can be essential to effectively detect rapid changes in terrestrial photosynthetic activity and canopy structures across sensors. Within FLEX/Sentinel context, the particularly coarse temporal resolution of FLORIS (two weeks) together with the open availability of Sentinel data highly motivate the development of time-resolved vegetation indices (VI) in S3 and the forthcoming FLEX which can serve as SIF proxies [13]. In other words, the temporal prediction of VI can play a fundamental role in actual FLEX/S3 exploitation environments for the analysis of the temporal dynamics of photosynthetic changes.
In the RS literature, it is possible to find different types of temporal prediction methods for VI [14]. Whereas traditional temporal replacement and temporal filter methods are able to provide positive results under rather constrained scenarios (e.g., homogeneous landscapes, temporal stability, partially missing data . . .), the higher generalization capabilities of learning-based models make this paradigm more suitable for time-resolving operational RS data. For instance, Zeng et al. presented in [15] a linear regression (LIN)-based framework for effectively reconstructing temporal information from the Moderate Resolution Imaging Spectroradiometer (MODIS). Zhao et al. [16] showed the advantages of modeling the land surface temperature of MODIS via a random forest regression (RFR). Other authors opt for using deep learning techniques for dealing with the temporal reconstruction and product regression problems. It is the case of Zhang et al. who developed in [17] a novel convolutional neural network (CNN) which is able to reconstruct missing MODIS and Landsat 7 data by considering multisource spatial, temporal, or spectral information as input. Following a similar idea, Shao et al. [18] proposed a novel generative adversarial CNN for reconstructing missing RS data from multiple data sources. Aptoula and Ariman [19] also define a CNN-based regression architecture for effectively estimating chlorophyll-a concentration from S2 data.
Despite the good results achieved by these and other related models [20], [21], standard CNNs generally exhibit limitations on the temporal information exploitation since they are mainly focused on the spatial-spectral dimensions, being the temporal data often stacked inside the own spectra. With the evolution of deep-learning technologies, additional models have been used for the temporal prediction of VI. Yu et al. [22] developed a deep recurrent neural network (DRNN) which exploits spatio-temporal features to predict S2 and MODIS VI based on the long short-term memory (LSTM) architecture. Although the authors of this recent work concluded that recurrent models can be considered a state-of-the-art technology in VI temporal prediction, the inherent temporal limitations of operational FLEX/Sentinel data may constrain the resulting performance in actual production environments where discontinuous short-term data can be expected. It is noted that recurrent networks have shown to be particularly effective for modeling long-term temporal dependencies. However, the relatively recent availability in years of Sentinel data together with the expected short lifetime design of FLEX (3.5 years) motivate the investigation of other types of temporal models. In this scenario, this work presents a novel deep 3-D convolutional regression network (3CRN) specially designed for temporally enhancing S3 VI from an inter-sensor perspective. That is, our objective consists of taking advantage of Sentinel synergies to generate time-resolved OLCI VI as SIF proxies from the MSI sensor. In this way, the temporal resolution of S3 can be improved to rapidly detect even small vegetation changes that can potentially be detected from S2 using also its substantially higher spatial resolution. To achieve this goal, we define in Section II the newly proposed inter-sensor regression model based on 3D-CNNs, which jointly exploits S2 spatial-spectral features together with their own temporal evolution to better estimate time-resolved VI in OLCI. In Section III, we build a database made of multi-temporal OLCI and MSI data from a study area in Extremadura (Spain). Then, we conduct an extensive experimental comparison, including multiple stateof-the-art regression and CNN-based VI prediction models, to validate the proposed approach performance. Finally, Section IV provides the main conclusions of this work.

II. METHODOLOGY
The proposed CNN architecture aims at mapping multitemporal S2 image products onto their corresponding S3 VI values for predicting inter-sensor vegetation data at unavailable timestamps in S3. Specifically, short sequences of S2 multispectral image patches around a target output time are considered as the input data whereas pixel-wise S3 VI estimates at the target timestamp correspond to the model output. To effectively exploit short-term time series data, the proposed architecture provides a novel inter-sensor regression network based on 3D-CNNs. In more detail, 3-D convolutions are defined over the spatio-spectral and temporal domain of S2 data, so that 4-D kernels slide across height, width, and time dimensions with the objective of finding temporal correlations between multispectral samples and allowing the extraction dynamic characteristics not only in the spatial domain, but also along the temporal dimension. It is noted that S3 VI, such as the OLCI terrestrial chlorophyll index (OTCI), may change in time periods shorter than the own instrument temporal resolution. Thus, the multispectral evolution of S2 along relatively small time periods may also provide key information to generate better inter-sensor estimates.
Let X = {x 1 , . . . , x N } be a multi-temporal collection of S2 images sequentially acquired over a particular region of interest with a fix spatial-spectral size of (x 1 × x 2 × B). Let Y = {y 1 , . . . , y N } be their synchronous inter-sensor S3 VI that cover the same area with a lower spatial resolution of (y 1 × y 2 ), representing R the corresponding scaling factor (i.e., x 1 = y 1 R and x 2 = y 2 R). From each j th pixel of the i th temporal sample y j i ∈ R, it is possible to extract a multispectral S2 image patch x j i ∈ R (P×P×B) that includes the same Earth surface extent, being P ≥ R. Additionally, a T multi-temporal window can also be considered centered at the i th timestamp as x j i T ∈ R (P×P××T ×B) . LetŶ identify the whole set of S3 pixel values in Y (i.e., y j i ) andX their corresponding spatio-spectral and temporal volumes in X (i.e., x j i T ). In this scenario, the proposed model approximates a function F :X →Ŷ which takes as input multi-temporal S2 patches and produces as output their inter-sensor S3 VI values via 3D-CNN transformation blocks. It is noted that the considered 3-D convolutions allow the kernel to slide across height, width, and temporal dimensions. Hence, the whole S2 spectra is considered for uncovering spatio-temporal dynamics and generating S3 VI predictions. Since the network output is a pixel-wise prediction from limited temporal data, a sequential CNN structure will be chosen to efficiently address the inter-sensor mapping in contrast to other recurrent schemes which tend to produce data under-fitting when small number of time samples are considered [23]. Taking advantage of these components, the proposed regression architecture is defined according to the following data flow stages (see Fig. 1).
1) Head: The first part of the network consists of two building blocks (H1 and H2) which are made of four sequential layers: 1) 3D-CNN; 2) batch normalization (BN); 3) Rectified Linear Unit activation (ReLU); and 4) 3-D pooling (3D-Pool). This configuration aims at processing the input images and generating an initial low-level characterization of S2 multispectral data. In this sense, the first head block (H1) extracts local features from the original S2 image domain to produce an overcomplete characterization of elemental patterns. Then, the second head block (H2) increases the number of kernels to model a broader combination of patterns with the target of isolating the most relevant low-level features when predicting S3 VI values. In more detail, the number of filters (K ) is set to 32 in H1 and 64 in H2, respectively. Besides, all 3D-CNN layers are defined with a kernel size of (3 × 3 × 3) together with a (1 × 1 × 1) stride. Finally, H1 and H2 consider 3D-Pool layers with a (2 × 2 × 1) pooling size and a (1 × 1 × 1) stride.
2) Body: The second segment of the proposed architecture is also made of two twin blocks (B1 and B2) that have the following structure: 1) 3D-CNN; 2) BN; 3) ReLU; 4) 3D-CNN; 5) BN; 6) ReLU; 7) 3D-Pool; and 8) residual connection (⊕). This block design has the objective of obtaining more complex and deeper spatio-temporal features from S2 data for progressively driving the uncovered deep embeddings toward pixel-wise S3 VI predictions. It is noted that the huge spatial resolution differences between S2 and S3 may generate important ill-posed effects during the back-propagation process since a large image patch in S2 corresponds to a single scalar value in S3. To relieve the degradation of the uncovered features after pooling the activation maps, the proposed body block includes a final residual connection. Besides, two 3D-CNN layers are considered before each 3D-Pool operation with the goal of increasing the spatio-temporal receptive field within each block. Similar to the network head, all 3D-CNN layers are defined with a (3 × 3 × 3) kernel size and one-pixel stride, while considering K = 64. Additionally, 3D-Pool layers are defined with a (2 × 2 × 1) pooling size, a (1 × 1 × 1) stride, and zero-padding for keeping the spatial dimension of the residual feature volume.
3) Tail: The last part of the network aims at projecting the extracted deep features onto the corresponding S3 VI values to achieve the required inter-sensor mapping. The following layers are considered within this building block: 1) a fully connected (FC) layer; 2) BN; 3) RELU; and 4) FC. On the one hand, the first FC layer has a total of 256 FC neurons that globally associate all the spatio-temporal features uncovered before prediction. On the other hand, the last FC layer is connected to a single neuron which provides the estimated value of the target S3 pixel from the input S2 multi-temporal patch. Finally, the parameters of the proposed network are learned by minimizing the mean-squared error (MSE) reconstruction loss.

A. Dataset
The dataset created for this work (https://github.com/ rufernan/3CRN) consists of 21 coupled pairs of S2 MSI reflectance images and S3 OLCI vegetation products of the same day along the year 2019. In more detail, we adopted the following three-step process: 1) study area selection; 2) operational data collection; and 3) final product generation. First (1), the Extremadura region was chosen for being the highest forest covered zone in Spain, with a 65% of forested area. The large amount of dehesas, an agrosilvopastoral system consisting of grassland and oak forests, makes this region highly relevant from a vegetation monitoring perspective, including different sorts of natural grasslands and agroforestry land. The exact location of the selected area is a 100 km 2 square bounded by (38.740370, −5.188652) and (39.697509, −6.498977) latitude-longitude coordinates, which corresponds to the T30STJ S2 tile. Second (2), daily synchronous S2 and S3 data products from 2019 covering this region were selected and downloaded from the Copernicus Open Access Hub. To avoid cloud contamination, we initially filtered S2 products by considering only those images with less than a 1% of cloud coverage. Once the cloudless S2 products were obtained, a S3 product of the same day was selected for pairing a total of 21 coupled S2/S3 images, with a maximum inter-platform sensing difference time of an hour. Then, S2 images were downloaded as MSI Level-2 bottom-of-atmosphere products and S3 images as OLCI Level-1 top-of-atmosphere products. Third (3), the downloaded images were processed to generate the final S2 and S3 collections (i.e., X and Y). On the one hand, S2 images were spatially re-sampled to 20 m to produce uniform data volumes of (5490 × 5490 × 12). On the other hand, S3 images were rectified via the Rayleigh correction, projected, and cropped to match the region of interest with a (366 × 366 × 21) data size. Finally, normalized difference vegetation index (NDVI) and OTCI were computed to build the target VI.

B. Experimental Settings
To validate the performance of the proposed approach, we conduct multiple regression experiments based on the mapping between multi-temporal MSI patches and their associated OLCI VI values. Specifically, the experimental comparison includes the following models available in the RS literature: LIN [15], RFR [16], multilayer perceptron (MLP) [20], 2-D CNN (CNN2D) [20], CNN-based chlorophyll regression (CNNR) [19], DRNNs [22], and 3-D CNN (CNN3D) [20]. All the methods use the configuration described in the corresponding letters, while considering analogous settings in terms of hidden units, filters, kernel sizes, and activation functions. Regarding the considered data, they have also been trained and tested using the same data partitions extracted from the created collection. For computational reasons, S2 images were reduced to a 1/3 ratio with respect to S3 (R = 3). Besides, a couple of S2/S3 products was excluded (from the available 21 image pairs) for generating a test qualitative output map. The rest of the data were divided into multi-temporal patches with size P = 5 and temporal window T = 5. Under this scheme, five different random partitions with balanced VI values were generated for cross-validating all the models, using a 60% of the data for training, another 20% for testing, and a 20% for validation. All CNN-based models were trained using the ADAM optimizer with a 10 −3 learning rate with plateau decay, a 128 batch size, and 100 epochs. The experimentation has been carried out using Python 3.6, Keras and Scikit-learn on a Ubuntu 16.04 x64 machine with Intel(R) Core(TM) i7-6850K, NVIDIA GeForce GTX 1080Ti, and 64 Gb of RAM.

C. Results
Table I presents the quantitative evaluation of the results based on three regression metrics: root mean square error (RMSE) and coefficient of determination (R 2 ). As it is possible to see, the tested models are arranged in rows whereas In this sense, DRNN and CNN3D have shown to be the best-performing competitors since their architectures are tailored for effectively working with multi-temporal observations, being DRNN better with NDVI/OLCI and CNN3D with PSRI-NIR by a relatively small margin. Nonetheless, the proposed approach is able to provide consistent improvements across all the considered VI and metrics. In Table II, the corrected re-sampled t-test [24] reveals the statistical significance of the obtained average improvements with a significance level of α = 0.05. The qualitative results displayed in Fig. 2 also support these observations. According to the magnified details, the NDVI map estimated by the proposed model is the one that provides the most distinguishable predictions for dense vegetation areas while inferring inter-sensor estimates closer to the corresponding ground-truth values. With respect to the other methods, the presented architecture has two main advantages: 1) multi-temporal feature patterns and 2) short-term data adaptability. On the one hand, the use of spatio-temporal kernels that slide across time allows the proposed network to uncover dynamic patterns which become more meaningful than those convolutional features directly extracted from staked S2 data. This can be noticed in the experimental comparison, where the best performing competitors are certainly those that are able to exploit temporal features. On the other hand, the proposed network topology has been specially designed to avoid temporal over-fitting when accounting for the specific nature of S2 and S3 data. In contrast to other successful temporal models, like the LSTM used by DRNN, the presented regression framework embeds the multi-temporal data into a new convolutional dimension which allows extracting discriminating temporal features even from short time series that may over-fit other schemes. Besides, the configuration of layers and blocks has been adapted to retain as much as possible the inter-sensor spatial information taking into account the important spatial resolution differences between S2 and S3 instruments. All in all, these two advantages contribute to the better performance of the proposed model to time-resolve S3 VI from multitemporal S2 imagery.

IV. CONCLUSION
This letter proposes a novel 3D-CNN architecture (3CRN) to generate temporally enhanced S3 VI from multi-temporal S2 data. The proposed model not only takes advantage of inter-sensor spatial-spectral data, but also short-term temporal information with the objective of finding temporal correlations that may help to uncover S3 VI as SIF proxies at unavailable timestamps. The experimental comparison, conducted over a collection of 21 coupled MSI and OLCI products, validates the performance of the presented architecture with respect to multiple traditional and CNN-based regression methods. The main conclusion that arises from this work is the importance of considering multi-temporal data for generating more accurate inter-sensor predictions and how the proposed 3D-CNN may help to relieve important spatial differences between instruments. In the future, we plan to extend this work to other multimodal platforms.