A New Deep Generative Network for Unsupervised Remote Sensing Single-Image Super-Resolution

Super-resolution (SR) brings an excellent opportunity to improve a wide range of different remote sensing applications. SR techniques are concerned about increasing the image resolution while providing finer spatial details than those captured by the original acquisition instrument. Therefore, SR techniques are particularly useful to cope with the increasing demand remote sensing imaging applications requiring fine spatial resolution. Even though different machine learning paradigms have been successfully applied in SR, more research is required to improve the SR process without the need of external high-resolution (HR) training examples. This paper proposes a new convolutional generator model to super-resolve low-resolution (LR) remote sensing data from an unsupervised perspective. That is, the proposed generative network is able to initially learn relationships between the LR and HR domains throughout several convolutional, downsampling, batch normalization, and activation layers. Then, the data are symmetrically projected to the target resolution while guaranteeing a reconstruction constraint over the LR input image. An experimental comparison is conducted using 12 different unsupervised SR methods over different test images. Our experiments reveal the potential of the proposed approach to improve the resolution of remote sensing imagery.

imagery that are useful to tackle new challenges and needs [1]. Nonetheless, the increasing demand of highly accurate remote sensing imaging applications, such as fine-grained classification [ The general objective in super-resolution (SR) [11]- [14] is to improve the image resolution beyond the sensor limits. That is, increasing the number of image pixels while providing finer spatial details than those captured by the original acquisition instrument. Depending on the number of input images, it is possible to distinguish between two kinds of SR methods, single-image [15] and multi-image [16]. Whereas single-image SR techniques use a single image of the target scene to obtain the super-resolved output, multi-image SR methods require several scene shots simultaneously acquired at different positions. In remote sensing, the single-image approach is usually adopted because it provides a more general scheme to super-resolve any kind of imaging sensor out, while unsupervised learning methods do not need any external data to train. On the other hand, the CNN is a very flexible model that can be adapted to different learning models, such as convolutional autoencoders (AEs) [39], [40], convolutional deep belief networks (DBNs) [41], convolutional generative adversarial neural networks (GANs) [42], convolutional recurrent neural networks (CRNN) [43] or fully convolutional networks (FCN) [44], among others. In particular, we highlight the hourglass network [45], [46], whose topology is symmetric, related to the convolution-deconvolution architecture, and also to the encoder-decoder, characterized by a first step of pooling down to a low resolution (composed by convolutional and max pooling layers) and a second step of upsampling to a higher resolution and combining features across multiple resolutions.
Following the hourglass approach, a new unsupervised neural network model is proposed in this work in order to super-resolve remote sensing images. The novelty of the proposed approach lies on using a generative random noise to introduce a higher variety of spatial patterns which can be promoted to a higher scale throughout the network according to a global reconstruction constraint.
Even though the relevance of generating new spatial variations when super-resolving remotely sensed data in a unsupervised manner, this is, to the best of our knowledge, the first time an unsupervised generative network model has been successfully formulated to super-resolve remote sensing imagery. Specifically, a convolutional generator network has been adopted, where from a given image X LR ∈ R C×W ×H , a higher resolution version X HR ∈ R C×t·W ×t·H is generated (being W < t · W and H < t · H, with t being a factor of resolution).
In addition, the algorithm has been adapted to be efficiently executed in parallel on graphics processing units (GPUs) 1 and presents some methodological improvements to make the model more efficient and effective.
To summarize, the main contributions of this work can be highlighted as follows: • An hourglass convolutional neural network model is developed to perform unsupervised superresolution.
• In particular, a convolutional generator model has been implemented to super-resolve low-resolution remote sensing images.
• Starting from generative random noise, the model is able to reconstruct the image, promoting it to a higher scale according to a global reconstruction constraint.
• Experiments over three datasets, with 2 scaling factors and 12 different SR methods, reveal the competitive performance of the proposed model when super-resolving remotely sensed images.
The remainder of the paper is organized as follows.
Section II presents an overview of single-image SR methods and their limitations. Section III describes the methodology employed by the proposed convolutional generator model. Section IV validates the proposed approach by performing comparisons with different singleimage SR methods. Finally, Section V concludes the paper with some remarks and hints at plausible future research lines.

A. Brief single-image SR overview
Broadly speaking, single-image SR algorithms can be categorized into three different groups [53], [54]: image 1 The use of high performance computing methods (HPC), including parallelization with accelerators such as field programmable gate arrays (FPGAs) and GPUs [47]- [49], or the distribution with clusters and clouds [50], [51], have demonstrated great utility for the classification of remote images [52]. [59]. The rationale behind IBP is based on iteratively refining an initial interpolation result by means of minimizing the reconstruction error between the LR input image and a simulated low-resolution version of the super-resolved result. GP takes advantage of the fact that the shape of the gradient profiles tends to remain invariant across scales, therefore LR gradient can be used to reconstruct the output image sharpness. PSF deconvolution methods tackle the upscaling problem from a deblurring point of view, that is, they initially estimate the imaging model PSF and then they try to remove the interpolated image blur.
Regarding LE methods, this type of techniques are able to provide a more powerful SR scheme because they learn the relationships between LR and HR domains from an external training set containing ground-truth HR images. As Fig. 2 [66], the blur operator is estimated at the same time as the SR output is generated through an optimization process.

B. SR limitations in remote sensing
Each single-image SR methodology has shown to be particularly effective under specific conditions [15], [54]. RE methods are able to reduce the noise as well as the blur and aliasing inherent to interpolation kernel functions. However, the lack of relevant highfrequency information in the LR input image limits their effectiveness to small magnification factors, which can be an important limitation for many of the currently operational (moderate) resolution satellites [67].
LE-based techniques potentially overcome these drawbacks by learning the relationships between LR and HR domains from an external training set. Nonetheless, the availability of suitable HR training examples can also be a serious constraint for many satellites. Note that groundtruth HR images are usually not available in real scenarios, and this may lead to an unrepresentative training phase with a biased super-resolved result. Eventually, the application of LE-based SR methods in actual ground segment production environments is rather limited [68].
HY methods offer the advantage of not requiring any external training set to learn the LR/HR relationships by taking advantage of the patch redundancy property over scales. However, the probability of finding patches sat-  providing the opportunity to offer new super-resolved data products in satellite and airborne missions that use relatively inexpensive sensors without the need of using any external HR training set. Nevertheless, the number of works in the remote sensing literature dealing with the unsupervised SR problem is rather constrained, and this is precisely the gap that motivates this work.
In [72], authors propose a SR approach using a backpropagation neural network as a regression function, and basing on (i) spectral unmixing, (ii) super-resolution mapping and (iii) self-training, which is exploited taking advantage of the embedding provided by the spectral unmixing process itself. However, this approach could be highly affected by the spectral simplex geometry of the input image [73]. In contrast, a hybrid (also called self-learning) SR scheme has been proposed in this work to super-resolve remote sensing data from an unsupervised perspective, basing on a new end-to-end convolutional generator model. The rationale behind the proposed approach is based on learning the relationships between the LR and HR domains by down-sampling the original input image to a lower scale and then using the learned relations at a lower scale to project the LR input image to the target resolution. However, the amount of spatial information that it is possible to retrieve from a down-sampled LR image may be limited, so a random generative noise has been additionally introduce together with a global reconstruction constraint to activate a higher amount of consistent spatial variations along the SR process. That means, random spatial variations are initially generated to be introduced in the self-learning process in order to mitigate the ill-posed nature of the SR problem. Regarding the proposed network global scheme, it provides a similar end-to-end framework to other deep learning-based approaches, e.g. [69]- [71], where the original LR image is used to learn the downsampling filters at the same time that they are also used to generate the super-resolved output.

III. METHODOLOGY
Traditionally, a generator network is an algorithm for image generation, where given a random variable z, the model is able to learn internal relationships (represented by the model parameters θ) to generate an image X = f θ (z), i.e. a regression problem. This allows us to learn the distribution of the data and the correlations between z and X. We can follow this approach in order to perform SR over remote sensing images, where z ∈ R C×W ×H is random noise and X ∈ R 3×W ×H is the desired RGB high resolution image.
Given a LR image X LR ∈ R 3×W ×H the SR's goal is to improve the image resolution beyond the sensor limits obtaining a HR version X HR ∈ R 3×t·W ×t·H from X LR , where t is the resolution factor and W < t·W , H < t·H.
In order to do this, a deep model based on CNNs has been implemented. This kind of networks are composed    by layers that are applied over defined regions of the input data, i.e. they are local-connected to the input, transforming the input volume to an output volume of neuron activations which will serve as input to the next layer. The fact that each layer is not completely connected to the previous layer (only with a patch/window defined as the receptive field) is a great advantage for data analysis, reducing the number of connections in the network, where each layer composes feature extraction stages working as a filter or kernel over patches of the input volume.
Depending on the treatment of the data, CNNs can be classified into three categories. Supposing that x (i) ∈ C ] is a pixel with C spectral bands of image X ∈ R C×W ×H , with i = 1, 2, ..., W · H, while P (j) ∈ R b×p×p is a patch of X, where p is the width and height (with p ≤ W and p ≤ H) and b the number of spectral bands of the patch (with b ≤ C). 1D-CNN models take separately as input data each pixels vector x (i) , extracting only spectral information [74]. On the other hand, 2D-CNNs extract spatial information, taking as input data the entire image X [75] or image patches P (j) [76], where C and b are set to small values, i.e. the spectral information is not very relevant compared to the spatial information. Finally, 3D-CNNs extract spectralspatial information, taking normally as input data patches P (j) of the original image X [29], [30], where C and b are set to large values, i.e. the spectral information is very relevant and it is combined with spatial information.
Usually, for panchromatic and RGB remote sensing images, a 2D-CNN approach is taken while 1D-and 3D-CNNs are usually for multi-and hyperspectral images. This paper works with RGB remote sensing datasets, so a 2D-CNN architecture has been implemented to take advantage of the spatial information contained in the images. It is composed by five different kinds of layers, being o As result, O (i) ∈ R n (i) ,W ,H forms a data cube whose depth is defined by the number of kernels n (i) (that indicates the number of output feature maps) and its width and height are calculated as: where γ and β are learnable parameter vectors, and is a parameter for numerical stability.
• Activation layer: after CONV and BATCH-NORM layers, the activation layer or non-linearity layer embeds a non-linear function that is applied over the output of previous layer, as the rectified linear unit (ReLU) [77], [78]. In this case, the LeakyReLU function is implemented [79]: where α is a small non-zero parameter, normally 0.001. In fact, each block d (i) is reducing the space information, i.e. generating a low spatial resolution data that will feed the second up-sampling step.
2) The up-sampling step is symmetric to downsampling one and it is also composed by N blocks of layers, called u (i) (i = N, ..., 2, 1), where the input of each one is the output of the previous one. In this case, each u (i) is composed by several stacked layers. The first one is a BATCH-NORM layer, followed by the first CONV layer C  In fact, the output of s (i) is concatenated to the input of u (i) . The chosen topology is depicted in Fig. 4. At the end of the topology, an output block is added, composed with a CONV layer and a sigmoid function at the end.
As result, a HR image X HR o ∈ R 3×t·W ×t·H is generated as output of the network.
In particular, the SR's goal is to generate a HR image from a LR one, minimizing the following cost function: In fact, our remote sensing datasets are composed by HR images. However, we cannot use them because they cannot be considered as ground-truth to perform SR. In order to solve this, a LR version is generated from each HR image by a down-sampler φ : R 3×t·W ×t·H → R 3×W ×H , so X LR = φ(X HR ). In our case the down-sampler φ has been implemented using Lanczos3 resampling [81], where pixels of the original image X HR are passed into an algorithm that averages their color/alpha using sinc functions. With this LR version we can perform the SR task. However, the model is generating a HR image, In order to solve this, the down-sampler function φ is applied over X HR o . At the end, equation 4 can be rewritten as: The cost function defined by equation 5 is optimized iteratively by the model via Adam optimizer [82]. The proposed method is summarized in Algorithm 1. Also, in Fig. 8 repeat 4: In order to test the proposed model, two networks have been implemented. The first one performs a 2x SR over a LR image X LR ∈ R 3×W ×H , i.e. the resolution factor is set to t = 2, obtaining a X HR ∈ R 3×2·W ×2·H HR image, and the second one performs a 4x SR, i.e. t = 4 obtaining a X HR ∈ R 3×4·W ×4·H HR image. Following the scheme presented in Fig. 4, both models have been implemented with the topology described in Tables I and   II.

A. Metrics
In order to compare the properties of the obtained Bottle-neck connection Up-sampling connections Output connections Skip connections Following equation 6, where n samples is the number of pixels of X and X max and X min are the maximum and minimum values of image X, respectively, the normalized root mean square error (NRMSE) measures the distance between the data predicted by a model, X o , and the original data observed from the environment X that we want to model.
Peak signal-to-noise ratio (PSNR) [ Up-sampling connections Output connections Skip connections is of higher quality.
Spectral angle mapper (SAM) [84] calculates the angle between the corresponding pixels of the superresolved image X o and original image X in the domain The universal image quality index, also called Qindex, gathers three different properties in the image evaluation: (a) correlation, (b) luminance and (c) contrast.
An extension of Q-index is the structural similarity (SSIM) [85], a well-known quality metric used to measure the similarity between two images. It is a combination of three factors (loss correlation, luminance distortion and contrast distortion).
Erreur relative globale adimensionnelle de synthese (ERGAS) [86] measures the quality of obtained X o taking into account the scaling factor to evaluate the super-resolved image.
IV. EXPERIMENTS

A. Experimental Configuration and Datasets
In order to test the performance of the proposed such as [69]- [71], [88]. The employed repositories are described below, and are publicly available on this repository 2 .

C. Discussion
According to the quantitative assessment reported in Tables V-IV, Fig. 6(h) is certainly the most similar to its HR counterpart in Fig. 6(a). Even though the result provided by SRI (Fig. 6(d)) seems to obtain a slightly better contrast on some parts of the image, the proposed approach is able to introduce more high-frequency information in the boat structure. In addition, it is possible to see that the proposed approach also introduces some shadow fine details which are not present in the others methods' results.
When considering a 4× ratio, the proposed approach shows even better capability to recover high-frequency information while preserving HR details to avoid undesirable visual artifacts in the super-resolved output. For instance, it is the case of the result provided by SRI in Fig. 7(d) which provides a remarkable sharpness on edges, however it generates a kind of ghosting effect and also alters several shapes in the image. Despite the fact that TSE (Fig. 7(g)) is able to overcome some of these

D. Advantages and limitations of the proposed approach
When comparing the proposed approach performance with respect to the best ones obtained in the experiments, we can observe the high potential of the proposed deep generative network to super-resolve remote sensing data.
To date, the hybrid approach used by SRI and TSE has shown to be one of the most effective ways to learn useful LR/HR patch relationships under an unsupervised SR scheme. However, this straightforward approach of searching patches across scales is rather constrained to the quality of the spatial information appearing in the LR input image. That is, the super-resolved result often tends to suffer from ghosting artifacts and watering effects as the magnification factor increases (Fig. 7).
Even though TSE deals with this issue by allowing patch geometric transformation on the searching patch criteria, i.e. patches can occur in a lower scale as they are or even transformed, this process does not actually introduce any new spatial information in the output result which eventually may limit the SR process, especially in the remote sensing field. Note that remotely sensed imagery are usually a highly complex kind of data because they are usually fully-focused multi-band shots with plenty of different spatial details within the same image. As a result, the generation of a consistent spatial variability becomes a key factor to improve the unsupervised remote sensing SR process.
Precisely, this is the objective of the proposed ap-   - [71]. Logically, our model has a different nature because of its unsupervised scheme, however it seem reasonable to make this consideration because the PSNR index, which is based on the MSE, is one the most On the other hand, the computational cost of the proposed approach may also become a limitation in some specific scenarios. According to the quantitative results shown in Table IV, the proposed approach takes over 300 and 150 seconds to process each input image considering a 2× and 4× ratios respectively. Even though the proposed approach has shown not to be one of the most computationally efficient methods, three important considerations have to be done to this extent. First, the computational burden is not only a drawback of the proposed approach but also of any deep learning architecture because this kind of technology usually provides a more powerful framework to cope with new challenges and tasks. Second, the implementation of our model has not been optimized to really exploit the GPU hardware resources in order to substantially reduce the resulting computational time. That is, we make use of standard functions but further efforts could be addressed to generate a much more optimized version of the code. Third, we use a general configuration of 4, 000 iterations as a security margin to guarantee a good network convergence, however this value could be reduced in order to significantly improve the proposed approach computational efficiency. Fig. 11 shows the evolution of the PSNR metric with respect to the number of iteration for harbor, circular-farmland, industry and road test images with a 4× ratio. As it is possible to see, the network is able to achieve a PSNR result that is very close to the optimal value after 2,000 iterations, therefore it would be possible to reduce the number of iterations in order to significantly decrease the proposed approach computational time. In Fig. 12, we also show the PSNR evolution over time to highlight the fact that the proposed approach is able to rapidly converge to the optimal PSNR value. It should be noted that we use a unique network settings in this work, therefore 4, 000 iterations are used to guarantee a good general parameter convergence, that  [10] T. S. Unger Holtz, "Introductory digital image processing: A remote sensing perspective," 2007.