Single-frame super-resolution in remote sensing: a practical overview

ABSTRACT Image acquisition technology is improving very fast from a performance point of view. However, there are physical restrictions that can only be solved using software processing strategies. This is particularly true in the case of super resolution (SR) methodologies. SR techniques have found a fertile application field in airborne and space optical acquisition platforms. Single-frame SR methods may be advantageous for some remote-sensing platforms and acquisition time conditions. The contributions of this article are basically two: (1) to present an overview of single-frame SR methods, making a comparative analysis of their performance in different and challenging remote-sensing scenarios, and (2) to propose a new single-frame SR taxonomy, and a common validation strategy. Finally, we should emphasize that, on the one hand, this is the first time, to the best of our knowledge, that such a review and analysis of single SR methods is made in the framework of remote sensing, and, on the other hand, that the new single-frame SR taxonomy is aimed at shedding some light when classifying some types of single-frame SR methods.


Introduction
From early years, improving image resolution has been one of the important concerns of the computer vision and remote sensing research communities in order to deal with acquisition technology limitations. Despite the fact that image sensors are constantly upgraded, there are several physical constraints which can be only addressed by signal and image processing algorithms. Decreasing the physical pixel size beyond a specific limit reduces the incoming light and generates more diffraction effects in the sensor what eventually reduces the final image quality. Furthermore, hardware costs are usually important constraints in many applications because acquisition technology cost and complexity depends on sensor resolution. Therefore, algorithmic-based approaches are sometimes more suitable because they allow improving the image resolution beyond the sensor limits whatsoever.
The general concept of Super-Resolution (SR) refers to those algorithms aimed at increasing the image resolution, that is, increasing the number of pixels but providing fine details in the resulting image as if a sensor with a higher nominal resolution would have been used. The increasing demand of aerial and satellite imagery applications motivates the development of new data products in order to cope with the new challenges that are constantly appearing nowadays (Bioucas-Dias et al. 2013;Li et al. 2015b,a). Specifically, many of the most important commercial satellites are currently working on providing new super-resolved products of data level-2, i.e. images beyond the sensor limit resolution including all the level-0 pre-processing as well as the radiometric and geometric corrections involved in level-1 data processing.
Depending on the number of input images, two kinds of SR schemes can be distinguished: single-frame and multi-frame. In the former type, only a single image of the objective scene is used to obtain the super-resolved output. In the latter type, several images of the same scene acquired at different positions are used to generate the higher resolution result.
Multi-frame super-resolution methods have been widely applied in hyper-spectral remote sensing, without or in combination with the multi-angular acquisition capability some of these sensors have (Bioucas-Dias et al. 2013;Simoes et al. 2015). For years, the data redundancy inherent in multiple images has caught the research community attention to enhance aerial imagery. However, there are many applications where the existence of multiple images of the same scene is not possible or difficult to obtain. There are two main situations in remote sensing where singleframe super resolution methods can be advantageous in relation to the multi-frame SR counterpart: • In remote sensing airborne missions that use small platforms with low resolution cheap sensors, and where an on-board image processing may also be an attractive possibility. These low resolution sensors usually have a low spatial resolution capability associated to them. • When super resolution is applied to satellites that have a long revisit time (for a particular application). The revisit time is the time that a satellite takes to make two consecutive observations of the same point on earth. According to the Committee of Earth Observations Satellites (CEOS) webpage (http://database.eohandbook.com/database/missiontable.aspx ), 49 out of the 140 satellites listed in that database has a revisit period of 7 days or less. A revisit time of one week or more may be too long for some biophysical processes in agriculture, for instance, during the peak growth period (Moreno 2004).
For these reasons, several works in the literature (Loncan et al. 2015;Chavez-Roman and V. 2014;Demirel and G. 2011) advocate the use of single-frame superresolution in remote sensing.

Limitations in the current literature
We can find several papers in the literature that provide a good overview of the existing SR algorithms, however few ones are specially aimed at remote sensing and most of them only consider a rather limited number of methods. For instance, a good up-to-date overview is presented in (Nasrollahi and B. 2014) where most International Journal of Remote Sensing IJRS- SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 of the papers published on the SR topic up to 2012 are reviewed. Even though the high quality of the work, the lack of an experimental part together with the generality of the application domain makes it difficult to draw practical conclusions in the remote sensing field. Other quality works, like , provide a comprehensive experimental part but the description and categorisation of methods sound unsatisfactory. Besides, experiments are conducted using only standard test images what makes it difficult to extrapolate conclusions to remote sensing. Other papers, like (Suganya, Mohanapriya, and Vanitha 2013;Zhang et al. 2014;Ramji, Punniakodi, and Praveen 2013), are only focused on super-resolving aerial imaging, however they show some shortcomings especially when it comes to single-frame SR. In all these works, the number of tested methods and the image diversity is rather reduced and often constrained just to multi-frame superresolution.
In a sense, the lack of aerial imagery benchmarking and standardized implementations of SR algorithms make it difficult to draw practical conclusions for remote sensing. Some important questions like, what are the state-of-the-art SR techniques? or, what are the practical differences among methods? remain still unclear in remote sensing. On the one hand, single-frame SR surveys tackle the SR problem considering a general application despite the fact that the SR performance highly depends on each specific field. On the other hand, remote sensing works tend to neglect single-frame SR methodologies even though these techniques are highly useful in remote sensing (Bioucas-Dias et al. 2013;Yang et al. 2015).
The evaluation criteria is another discussion point in SR (Reibman, Bell, and Gray 2006). There is not an universal criteria to assess the quality of the resulting images and this makes comparison among algorithms difficult. Some works use an experimental validation protocol based on ground-reference images and others include non-reference metrics. This variation of protocols and metrics makes the cross-comparison among SR methods cumbersome, especially in the remote sensing field where both spatial and spectral information are important.

Work Objectives and Paper Structure
Given the above described scenario about, the present work has the following aims: (i) to study the rationale behind many of the state-of-the-art single-frame SR techniques, (ii) to present a clear taxonomy to simplify the categorisation and validation of single-frame SR methods, (iii) to conduct a comprehensive comparison of single-frame SR techniques in remote sensing and (iv) to provide access to a set of algorithms to the research community in order to make benchmarking and evaluation easier in the future. In a sense, this paper pretends to shed light on the single-frame SR research specially applied to remote sensing field.
The rest of the paper is organized as follows. Section 2 presents the background of the work by introducing the concept of SR in remote sensing. Section 3 describes the taxonomy used to classify all the considered methods. Section 4 defines the validation protocol proposed to perform the quality assessment. Section 5 shows and discusses the obtained SR results. Finally, Section 6 draws the main conclusions arisen from the work. International Journal of Remote Sensing IJRS- SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 2. Image Super-Resolution to Remote Sensing

Imaging Model
In general, the image acquisition process can be understood as a sequential set of degradations in which the continuous scene signal is converted by the sensor to a discrete output image. In the context of single-frame SR, the imaging model is typically defined as the concatenation of three operators: (i) Blurring (B), which represents the blurring factor introduced by the optical system, i.e. a low pass filter over the continuous scene, (ii) Decimation (D), that sub-samples the blurred continuous space according to the resolution of the sensor, i.e. down-sampling the high resolution scene, and (iii) Noise (N ), which introduces a final perturbation to the output image.
Equation (1) shows the imaging model considering the aforementioned operators. The starting point is the High-Resolution (HR) image I HR corresponding to the continuous scene. Note that the term continuous scene conveys the image at the highest possible resolution. Then, I HR is affected by all the operators in an orderly way to generate the final observed Low-Resolution (LR) image I LR .
In the field of remote sensing, the blur operator is specified in the technical description of the instrument as the Point Spread Function (PSF). This function can be seen as a blurring mask which describes how a point of the HR domain is blurred in the LR space during the acquisition process. The theoretical form of the blur, corresponding to a diffraction-limited system, is given by a Bessel function. However, due to some effects like lens aberration or atmospheric turbulence, the blur is often adequately represented by a Gaussian distribution. In other words, it is broadly accepted to approximate the PSF by a Gaussian function whose parameters (µ, σ) depend on the acquisition instrument characteristics. Another important parameter appearing in the sensor technical specifications is the noise level and type. In this case, it is common to consider an additive white Gaussian noise what eventually makes that both PSF and noise are usually approximated by Gaussian functions.

Super-Resolution as an inverse problem
The objective of SR consists in inverting the degradation process generated by the imaging model in order to obtain a super-resolved high resolution output image (I SR ) from its corresponding low resolution image (I LR ). This inverse problem has an ill-posed nature because different HR images might obtain the same LR image over a specific imaging model.

Super-Resolution vs. Interpolation
It is noteworthy that some authors refer to interpolation algorithms as SR even though there are differences between both concepts. Let us clarify this point in these lines. Interpolation methods project the initial LR image onto an HR grid and the missing pixel values are estimated using an interpolation function. Although some International  Journal  of  Remote  Sensing  IJRS-SR -FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 of these functions effectively reduce the aliasing effect caused by the decimation operator of the imaging model, they are not able to recover high-frequency details unlike SR methods.

Super-Resolution vs. Pansharpening
As it has been commented before, SR surveys hardly ever aim at remote sensing imagery but at global applications. However, other remote sensing related topics have been widely addressed in the literature and they can help us to adapt the SR approach to remote sensing. This is the case of Pansharpening techniques. In a nutshell, Pansharpening is a fusion problem in which a high spatial resolution Panchromatic image (PAN) is combined with a lower spatial resolution Multi-Spectral image (MS) in order to obtain a final result which integrates the spatial resolution of PAN and the spectral resolution of MS. Although both Pansharpening and SR have features in common, it is important to highlight their differences. The Pansharpening objective lies on reconstructing just the spectral information of the PAN image because all spatial details are available. Nonetheless, SR is a more general problem in which both spatial and spectral information have to be estimated from the LR image.

Super-Resolution of Multi-Spectral images
The standard SR algorithms are oriented to super-resolving gray scale images, that is, they are focused on enhancing LR images with a single band. In the case of RGB images, the typical super-resolving procedure is based on the transformation into the uniform color space YC b C r , where Y represents the luminance component, C b the blue-difference chroma component and C r the red-difference chroma component. Figure 1 shows the YC b C r decomposition of an example RGB image. Input multi-spectral images are initially converted to the YC b C r color space. Then, the luminance channel Y is super-resolved and both C b and C r bands are interpolated to the target resolution. Finally, the inverse YC b C r transformation is used to generate the final super-resolved output.
Satellites of general purpose like Geoeye-1 or IKONOS usually provide two kinds of images: (i) PAN, a wide spectral bandwidth filtered monochromatic and high resolution image and (ii) MS, a lower resolution multi-spectral image. The final user has to decide whether to super-resolve the PAN image, the MS image or both. In the case of super-resolving the PAN image, SR methods can be directly used because PAN images are made up of a single-band. In the case of MS images, the SR algorithms have to be applied to the luminance component once a false color RGB has been created.
There are different survey papers in the literature (Suganya, Mohanapriya, and Vanitha 2013;Nasrollahi and B. 2014;) that classify single-frame super-resolution methods according to multiple criteria. For instance, the classification criteria may depend on the domain where the SR is performed (e.g. frequency domain or spatial domain), the theoretical tools used to cope with the SR problem (e.g. Gradient Profiles, Neighbourhood Embeddings, Neural Networks, Bayesian Mixture Models, etc.) or even on the knowledge of the process parameters (e.g. blind or non-blind SR). However, many SR methods gather features of several of those groups which eventually makes the definition of a particular algorithm in terms of current taxonomies difficult to be proposed.
In this paper, we propose a relaxed taxonomy based on the functional properties of SR methods inspired by the taxonomy usually applied in Pansharpening (Vivone et al. 2015). Therefore, instead of using a main division of methods based on particular features about how they implement their solutions, we would rather consider the functional definition of the rationale behind the algorithms considered in this research. The following sections describe the three considered SR groups in detail: (i) reconstruction-based methods, (ii) learning based methods and (iii) hybrid methods.

Super-Resolution based on Image Reconstruction
The SR methods based on image REconstruction (RE) pursue to reproduce features appearing in the LR image to a higher resolution level. In other words, they are aimed at obtaining a final super-resolved image with the same perceptual properties of the LR image but in a higher scale. Somehow, RE methods try to replicate the quality of the LR domain after an initial interpolation in order to avoid the blurring and aliasing effects caused by interpolation algorithms. Figure 2 shows the general scheme related to RE methods. Specifically, this sort of methods are made up of three main steps. In the first one (stage 1), the LR image (I LR ) is upscaled to the target resolution using a regular interpolation algorithm, such us bicubic or lanczos interpolation. The second stage is based on extracting certain physical properties from I LR to figure out the quality of the details in the LR domain. Finally, in the third step (stage 3) both the interpolated image I LRI and the extracted LR properties are aggregated to obtain the final super-resolved result (I SR ).  -SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 do image reconstruction to provide the same level of details we can observe in the input image. All the existing RE algorithms can be described in terms of the three aforementioned stages, however each particular method proposes the use of a particular set of tools to reconstruct the result. Specifically, we can highlight three kinds of techniques as some of the most relevant ones inside the RE category: (i) Iterative Back Projection, (ii) PSF Deconvolution and (iii) Gradient Profile. The followings sections present a description of those techniques according to the scheme proposed in Fig. 2.

Iterative Back Projection
The Iterative Back Projection (IBP) technique was first introduced in (Irani and S. 1991) as a general multi-frame super-resolution algorithm being extended to the single-frame approach afterwards becoming one of the most popular SR methods in the literature. The idea behind single-image IBP (we just focus on the single-image approach due to the scope of the work) is based on iteratively refining an initial guess of the super-resolved image. Throughout this iterative process, the reconstruction error between the LR image and the low-resolution version of the superresolved image is minimised in order to obtain the final result. Eq.
(2) provides the reconstruction error at a specific iteration (E n ) given the initial low-resolution image (I LR ) and the super-resolved result of the previous iteration (I (n−1) SR ).
where D and B are the decimation and blurring operators, respectively, as in Eq. (1). Three stages define the method: (i) initialisation, (ii) down-sampling and (iii) back-projection. In the initialisation stage (i), the super-resolved image is initialised with an interpolated version of the low-resolution input. After that, stages (ii) and (iii) are alternatively applied until a convergence condition is reached (e.g. a maximum number of iterations or a threshold in the reconstruction error). In the down-sampling stage (ii), the previous estimation of the super-resolved image I (n−1) SR is blurred (B) and decimated (D) according to the corresponding imaging model to assess the reconstruction error E n using Eq. (2). In the back-projection stage (iii), the current estimation of the super-resolved image I (n) SR is obtained by projecting the reconstruction error to the previous estimation I (n−1) SR . Depending on the desired properties on the resulting image, a back-projection kernel can be applied over the projected error E n to obtain smoother or sharper images.

PSF Deconvolution
The super-resolution methods based on the PSF deconvolution tackle the upscaling problem from a deblurring point of view. As it was introduced in Section 2, the imaging model introduce a blurring effect in the acquisition process caused by the operator B. Specifically, this operator can be seen as the convolution of the sensor PSF with the HR scene. Using this approach, the super-resolution problem becomes one of undoing the blurring kernel effect.
It is theoretically possible to deconvolve a blurred image if we have access to the exact blurring kernel. However, the inherent noise that appears during any acquisition process together with the lack of information about the kernel, hinder that task. The instrument PSF may be unavailable or even it may not be completely where I * and I represent the convolved and deconvolved images respectively, * is the convolution operator, (PSF) represents the point spread function and N the additive noise. This deblurring problem has the same ill-posed nature as the SR problem. Therefore, many of the techniques used in SR are suitable to tackle the deconvolution problem as well. Non-blind approaches assume a specific model for the PSF, such us a normal distribution or a low-pass filter in the frequency domain, however blind approaches (Michaeli and M. 2013) are able to estimate the PSF from the convolved image by assuming some prior knowledge.

Gradient Profile
Gradient Profile (GP) SR methods take advantage of the fact that the gradient profile shape tends to be invariant across scales. Briefly speaking, GP describes a 1-D profile containing the gradient magnitude for each edge pixel along the gradient direction. Specifically, the general SR algorithm based on GP can be divided into three steps: (i) interpolation, (ii) GP extraction and (iii) reconstruction using a GP prior. In the first step (i), a regular interpolation algorithm is used to upscale the LR input image I LR . The second step (ii) is in charge of calculating the GP of the low-resolution image (∇I LR ) as well as the interpolated image (∇I LRI ) in order to obtain their sharpness. In the last step (iii), the GP of I LRI is transformed according to the sharpness of ∇I LR and this transformed GP (∇I * LRI ) is used to reconstruct the super-resolved image. Eq. (4) shows the expression used to minimise the energy in both image domain (E i ) and gradient domain (E g ) with a control parameter β.

Super-Resolution based on Image Learning
Super-resolution based on image LEarning (LE) is probably one the most relevant research directions within the super-resolution field nowadays. The idea behind LE methods is to learn the potential relationships between low-resolution and highresolution domains from an external training set and then to generate the final super-resolved image using this a priori knowledge. Figure 3 shows the three stages the general LE scheme can be divided into. The first stage (learning) is aimed at learning the low-resolution to high-resolution pair relationships from a specific training set. In the second stage (estimation), International Journal of Remote Sensing IJRS- SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 the LR input image is used to estimate those HR components that are related to the low-resolution elements appearing in I LR . Eventually, the third stage (image generation) combines the estimated HR components of the input image to generate the final super-resolved image. For the sake of clarity, we use the term generation for LE methods and the term reconstruction for RE methods. The learnt relationships between HR and LR training details set the quality of the super-resolved images. Therefore, training data needs to be relevant enough to learn useful structures for the kind of input LR image (I LR ). Authors usually refer to the structure containing the relationship between low-resolution and highresolution information as dictionary although it does not necessarily have to be implemented using an associative array. Somehow, it has to relate low-resolution and high-resolution components using a specific image characterisation. The most standard characterisation approach is the patch-based representation where pairs of HR/LR training images are cropped in order to learn functions to link them.
Each LE method uses its own strategies to learn the dictionary as well as to generate the final super-resolved output. Among the most popular techniques, the following three kinds of learning methods can be considered: (i) sparse coding, (ii) neighbourhood embedding and (iii) mapping.

Sparse Coding
Sparse Coding (SC) is a powerful image modelling technique which has been successfully used in several image restoration applications, being SR one of the most active domains. This kind of methods take advantage of the fact that natural images tend to be intrinsically sparse when they are characterised as a linear combination of small patches. In this way, SC algorithms pursue to obtain the sparse representation corresponding to the low-resolution input image and to generate the high-resolution output by means of the replacement of LR patches by their corresponding HR counterparts. In SC algorithms, we can highlight the following steps: (i) dictionary training, (ii) LR sparse code computation and (iii) SR output generation.
The first step is based on learning dictionaries for high-resolution (D h ) and lowresolution (D l ) image patches. These dictionaries are learnt by forcing the highresolution training images (I HR TRA ) and their low-resolution counterparts (I LR TRA ) to share the same sparse codes. Equation (5) shows the formulation to train both dictionaries D h and D l . arg min -FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 In Eq. (5), z represents the sparse codes shared by both dictionaries, h and l are the dimensions of high-resolution and low-resolution patches respectively and λ represents the Lagrange multiplier to balance the sparsity of the solution. Note that in the training step the sparse codes z are not useful by themselves because they are only used to obtain the dictionaries. The first two terms in Eq. (5) correspond to the optimisation fitting part and the last term acts as a regulariser. Regarding the norms appearing in Eq. (5), the 2 -norm is typically used for the fitting terms and the 1 -norm, for the regularisation term but other norms may be used depending on the particular SC method.
The second step in SC algorithms is related to the computation of the lowresolution input image sparse codes. The solution of Eq. (6) is the α coefficients from the trained dictionaries D h and D l .
In this expression, F represents the feature extraction operator typically implemented by a high-pass filter, P is in charge of extracting the overlapping area between the target patch and the previously reconstructed image, β controls de compatibility of neighbouring patches and the fidelity to the initial low resolution image and w contains the values of the previously reconstructed high-resolution image on the corresponding overlapping area.
Finally, the high-resolution output is generated in the third step by using the α sparse codes of the input low-resolution image over the high-resolution dictionary D h . Eq. (7) shows the expression to obtain this result.
Even though these three steps are the standard process of SC algorithms, an extra post-processing stage is usually applied to enforce a global reconstruction constraint. Due to the ill-posed nature of the problem, the generated solution I SR is an approximation of the real high-resolution image I HR . Therefore, the decimated version of the solution may differ from the LR input image. Eq. (8) allows eliminating this discrepancy by projecting the input image I LR onto the solution space.
D and B are the decimating and blurring operators respectively, I SR0 represents the initial guess of the solution provided by Eq. (7) and I * SR is the final output solution after the optimisation process which can be performed using a gradient descent approach. In a sense, the main aim of this global optimisation is to balance the fitting of the solution with the initial LR image and the solution with the initial high-resolution estimation by a factor c. Neighbourhood Embedding (NE) methods are inspired by recent manifold learning approaches which makes them also known as manifold-based methods. NE techniques assume that small image patches from low-resolution images and their high-resolution counterparts form low-dimensional non-linear manifolds with similar local geometries. This means that as long as enough samples are available, patches in the HR feature domain can be reconstructed as a weighted average of local neighbours using the same weights as in the LR feature domain. As in the case of any learning-based super-resolution algorithm, NE approaches can be described in terms of three stages: (i) manifold learning, (ii) LR neighbourhood reconstruction and (iii) SR output generation.
In the first step (manifold learning), both HR and LR manifolds are generated from a specific training set by means of minimising their corresponding reconstruction errors. Initially, a neighbourhood is found for each patch and then the reconstruction weights are computed using its own neighbourhood. The Euclidean distance is typically used to define the neighbourhood, however any other distance function may be used. Eq. (9) shows the reconstruction error for a patch p i , where N i represents the p i patch neighbourhood, p j is an element of this neighbourhood and x i,j is the array of reconstruction weights. The following constraints: x i,j = 1 and x i,j = 0 for any p j / ∈ N i apply to x i,j . The second step (LR neighbourhood reconstruction) is based on using the same idea but just on the LR input image. That is, this step is aimed at finding the LR input image reconstruction weights in order to represent each of its patches as a combination of neighbours in the LR manifold by minimising Eq. (9).
Finally, the high resolution embedding is computed in the third step (SR output generation) using the low-resolution reconstruction weights over the high-resolution patches, preserving the local geometry of the reconstruction weights. In other words, the reconstruction weights x i,j of the LR input image are used to seek patches h i in the HR manifold. Eq. (9) shows the expression to achieve that goal.

Mapping
Mapping (MP) based methods (also known as regression-based methods) consider super-resolution as a regression problem between high-resolution and low-resolution spaces. The underlying idea is to learn a mapping function from the input lowresolution images to the target high-resolution images based on a specific training set.Then, this learnt function can be used to generate the final super-resolved result from the low-resolution input image. In particular, MP methods are made up by three steps: (i) mapping function training, (ii) projecting the LR input image I LR into the HR space and (iii) generating the final SR output.
In the literature, we can find all kind of techniques to perform the aforementioned regression. Bayesian networks, kernel-based methods or even neural networks are International Journal of Remote Sensing IJRS-SR-FernandezBeltranLatorreCarmonaPla- Final-NoBlack-22-12-2016 some of the approaches which have been widely used to estimate the projection between LR and HR images. Even though many of these approaches use different theoretical tools to achieve this goal, the bottom line in the way the super-resolution process is performed is mainly the same. Because of that, we have decided to include all these methods in the same group despite the fact that other works may present a different classification. Let us review in this Section some of the most important mapping functions. One of the first SR estimators appearing in the literature was the Nearest-Neighbour (NN) function which works as follows. In the training stage, pairs of LR and HR patches are collected. The input low-resolution image patches are compared against the stored collection in order to select the corresponding high-resolution version of the nearest ones. Finally, the super-resolved output is generated guaranteeing a certain spatial compatibility over the chosen HR patches.
Note the differences between a neighbourhood embedding approach (Section 3.2.2) and a SR method based on a NN mapping function. In the first case, there is a reconstruction process of input low-resolution patches using a specific neighbourhood and then these reconstruction codes are used in the high-resolution domain to generate the final super-resolved image. However, a NN-based mapping approach does not have this reconstruction. The rationale behind MP methods lies on having a function to directly connect low-resolution to high-resolution patches, and this function may be based on an NN approach. Depending on the embedding reconstruction process, it is possible to obtain the same result obtained by an NN mapping function but both approaches start from different assumptions.
Although NN-based methods are able to provide a reasonable good performance, these estimators tend to suffer from over-fitting in SR due to the inherent problem complexity. In general, one of the most effective ways to avoid over-fitting in machine learning is by regularisation. In a sense, regularising the estimator allow us to discard those solutions which are not general enough to generate a suitable output. (Kim and Y. 2010) present a SR framework based on learning a mapping function via Kernel Ridge Regression (KRR) which uses the 2 -norm in both the fitting and the regularisation terms. Despite the effectiveness of this approach, KRR is one of the simplest mapping functions and more complex schemes have been developed (He and W.-C. 2011).
More recent works, such us (Yang and M.-H. 2013; He and W.-C. 2011) train several simple mapping functions instead of using a single complex one. Eventually, SR is performed by applying the function associated to its closest cluster to each patch. Other works use a different kind of non-linear mapping. For instance, (Dong et al. 2015;Liu et al. 2016) use deep convolutional neural networks as mapping functions.

Hybrid Super-Resolution
Hybrid (HY) SR algorithms have features from both RE and LE methods in order to address some of their shortcomings. In the case of RE methods, the unavoidable blur generated by the initial interpolation together with the lack of important high-frequency details in the low-resolution input image limits their effectiveness to small magnification factors (Baker and T. 2002). In the case of LE methods, the constraint is generated by the availability of a suitable training set. Even though image learning-based methods may be able to learn spatial details impossible to recover by a reconstruction approach, LE methods usually require a significant International Journal of Remote Sensing IJRS-SR-FernandezBeltranLatorreCarmonaPla- Final-NoBlack-22-12-2016 amount of relevant high-resolution images and this last requirement is not always possible. Hybrid methods try to reach an agreement between avoiding the use of external HR training images and performing a training process to learn LR/HR relations. All hybrid SR methods somehow exploit the redundancy property pervading natural images. This property refers to the fact that natural high-resolution images tend to contain repetitive structures within the same scale and over different scales as well. In general, HY algorithms can be summarised in the following three stages: (i) patch redundancy search, (ii) HR patch extraction and (iii) SR output generation. The first step (patch redundancy search) is aimed at both generating several lower scale images and extracting those patches which tend to appear across scales. Each particular HY approach defines its own assumptions about the imaging model to generate the lower scale images and about the criteria to seek repetitive patches (Glasner, S., and M. 2009;Freedman and R. 2011).

Super-resolution quality assessment
The effectiveness of super-resolution algorithms relies on the quality of the superresolved output images. Nevertheless, how to quantify the super-resolution result remains an open issue. It was widely accepted in the past that the best way to evaluate image quality was by the user's opinion. However, this kind of evaluation may be biased, slow and expensive, being often impractical in remote sensing. Besides, many applications are not user-aimed, i.e. images are not used by users but by programs in order to perform a specific task, for instance soil classification from remote sensing imagery. As a result, over the past decades an important effort has been made to develop automatic tools to perform such evaluation. Even though there are several image quality metrics available in the literature, each one tends to focus on a particular kind of visual features and therefore some metrics may become more useful than others depending on the final application of the super-resolved images.
In general, two kinds of quality metrics can be considered: (i) reference-based metrics and (ii) no-reference-based metrics. The former group uses ground-reference images to compare the super-resolved output against its true high-resolution image SR algorithms aim at recovering. On the other hand, the latter group is able to assess the super-resolved images without the use of any reference image. In the following sections, we are going to review some of the most popular metrics of both groups.

Image quality metrics with reference
Image quality metrics with reference compare the properties of a given image I with respect to its distortion-free version R by measuring the deviation between both images. In particular, the computation of these deviations can be focused on two different aspects, namely spatial and spectral distortions. Whereas spatial measures are more concerned about the similarity between geometric structures of both images, the spectral indexes try to detect radiometric differences.
One of the most well-known spectral indexes is the Spectral Angle Mapper (SAM) (Yuhas, H., and W. 1992), which is widely used in pan-sharpening spectral assessment (Vivone et al. 2015). Specifically, SAM considers each spectral band as a coordinate axis and then it computes the average angle between the pixels of I and International Journal -FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 R images. Eq. (11) shows the expression defining this index.
where N is the total number of pixels in each image. Note that the ideal value of SAM is 0. Other indexes account for the spectral and spatial distortion simultaneous measurement. A classical error measure which has been widely used in image quality assessment is the Root Mean Square Error (RMSE), defined as follows, where the index j is used to identify each one of the K image bands. However, the main drawback of this index is based on the fact that it is very sensitive to outlier values and other measures are usually preferred. Another widely used index is the Peak Signal-to-Noise Ratio (PNSR) which is computed according to Eq. (13).
A higher PSNR value represents a better image quality and its domain is all positive real numbers greater than 0. Even though PSNR seems to be the standard index in general purpose SR, it does not take into account the scaling factor to evaluate the super-resolved image. However, other measures defined within the Pansharpening context try to take advantage of this information. This is the case of the Erreur Relative Globale Adimensionnelle de Synthese (ERGAS) measure (Veganzones et al. 2016), defined in Eq. (14).
In this expression, S is the SR scaling factor and the over-line operator is used to represent the mean value of the band I j . The best ERGAS index value is 0 and an increasing value means more distortion for I. Although RMSE-based measures are appealing because of their simplicity, they are not able to capture the characteristics of the Human Visual System (HVS) and hence other quality measures have been developed. In particular, the Correlation Coefficient (CC) is one of them. Eq. (15) shows how it is computed.
The CC index can be assessed computing the covariance matrix between both International Journal of Remote Sensing IJRS-SR-FernandezBeltranLatorreCarmonaPla- Final-NoBlack-22-12-2016 images and dividing it by their standard deviation. In a sense, this index measures the linear relationship between two images normalising their range values. The best CC value is 1, implying that both images are linearly correlated. In remote sensing, it is common to use a particular case of CC which only takes edge information into account. That is, the spatial Correlation Coefficient (sCC) computes the CC index over the edges detected (Canny, Sobel or any other edge detector can be used to that purpose) in both I and R images. Another popular metric is the so-called Q-index (universal image Quality index) computed as follows.
This index gathers three different properties in the image evaluation: (a) correlation, (b) luminance and (c) contrast. The first term (a) corresponds to the Correlation Coefficient expression and therefore it is in charge of measuring the linear relation between both I and R images. The second term (b) and the third one (c) aim at comparing luminance and contrast, respectively. Note that in the case of multi-spectral images, the Q-index can be computed as the average value among spectral bands. An extension of the Q-Index was proposed in (Wang et al. 2004) in order to avoid instability around null values. Eq. (17) shows the expression associated to the Structural SIMilarity (SSIM) metric.
According to (Wang et al. 2004), C 1 and C 2 constants should be set to (K 1 L) and (K 2 L) respectively, where K 1 and K 2 are values close to 0 and L represents the image dynamic range. Similarly to the Q-index, the SSIM index range of values is [−1, 1] and it can be computed as the average over bands in the case of multispectral images.
Even though the availability of ground truth images is the ideal scenario to perform the image quality assessment, this is not possible in many real-life applications. However, alternative protocols have been developed to enable the use of metrics with reference image when ground truth images are not available.
The reduced-resolution protocol (Wald, T., and M. 1997) was initially defined within the Pansharpening context but it can be easily extended to SR as well. In particular, this protocol tests two properties: (a) consistency and (b) synthesis (see Fig. 4). On the one hand, the consistency property is based on the fact that the low-resolution version of the super-resolved image has to be as similar as possible to the initial input image. In other words, the LR input image is the ground truth of the super-resolved image after using the imaging model operators. On the other hand, the synthesis property performs the evaluation at a lower scale. That is, the initial LR image is reduced following the imaging model to generate a smaller low-resolution image. Then, this smaller image is used to perform SR to the initial LR scale having its corresponding reference image available.

No-reference image quality metrics
The availability of ground truth images provides the ideal scenario to perform image quality assessment, however several no-reference metrics have also been proposed to automatically predict the image quality without any kind of ground truth image. Broadly speaking, it is possible to summarise the existing no-reference metrics in two different families (Nouri et al. 2013): (1) distortion-based and (2) trainingbased metrics. On the one hand, the metrics belonging to the first group use a specific distortion model to quantify image peculiarities, such us blocking artifacts, ringing or blur. On the other hand, training-based metrics require an initial training process to learn high-quality image statistics and, using this information, they are able to quantify the image quality. Over the past years, several no-reference metrics belonging to both families have been presented. However, training-based metrics have shown to provide results that are more correlated to human judgements than distortion-based ones (A.C. 2010).
One of the most popular training-based metrics is Blind Referenceless Image Spatial QUality Evaluator (BRISQUE) (Mittal, K., and C. 2012). This measure is based on the observation that normalised luminance values tend to follow a unit Gaussian distribution for distortion-free natural images. Specifically, BRISQUE computes the normalized luminance values by local mean subtraction and contrast normalisation, fitting these values to a Generalised Gaussian Distribution (GGD) afterwards. This learnt GGD is finally used to predict the type of distortion as well as the perceptual quality. The implementation provided by the authors uses a Support Vector Regressor (SVR) to learn the quality scores from a given training set but any other regressor could be used instead. Another similar popular trainingbased metric is the Natural Image Quality Evaluator (NIQE). In this case, the NIQE index uses a distortion-free training set to fit a multivariate Gaussian (MVG) distribution. Then, image quality is computed as the distance between the input image features and the MVG distribution previously learnt. Regarding the kind of features, the NIQE metric uses features that are similar to those used by BRISQUE, that is, normalised luminance values fitted to Gaussian distributions. The main difference between BRISQUE and NIQE approaches lies on the fact that the second one does not require any regressor calibration.

Datasets
As outlined in the introduction, the main objective of this work is to provide a practical overview about single-frame super-resolution algorithms performance in typical remote sensing imagery. Hence, the perfect scenario to highlight practical differences among methods is by super-resolving images of the same area captured using different sensors resolution. Besides, the availability of ground truth images is also important to enable a full-reference evaluation in order to quantify the deviation between super-resolved images and the reference high-resolution ones. Nevertheless, an important volume of the total amount of remote sensing images currently acquired belongs or is available only through private companies rights, which hinders the possibility to find public domain images containing the same area and, at the same time, acquired by different sensors. Moreover, ground truth images are often unavailable in the case of satellite images because of the own sensor limitations. In fact, the lack of HR images is the main motivation behind the use of SR methods in remote sensing.
Taking these reasons into account, we have simulated different databases from a specific real set of high-resolution aerial images. Note that the difference among sensors lies on their imaging models. Therefore, it is possible to generate several databases from the same set of ground truth images by applying different imaging models. This approach gives us three main advantages. First of all, ground-reference images are available which allows us to apply both full-reference and no-reference evaluation protocols. Second, we can easily generate as many databases as we need by using different imaging models, for instance changing the noise level or the down-sampling scale. Thirdly, imaging model features are controlled, which allows us to analyse SR performance under acquisition model changes.
The reference images used in this work have been selected from the open-access orthoimages of the National Aerial Ortophoto Program (PNOA) of the Spanish Ministerio de Fomento (Arozarena, G., and N. 2005). These RGB images have a resolution of 0.25 mpp (meters per pixel), higher than those of satellites like WorldView-3 (PAN: 0.31 mpp, MS: 1.24 mpp) or GeoEye-2 (PAN: 0.34 mpp, MS: 1.36 mpp), and they are publicly available from the National Geographic Institute (IGN) website (http://pnoa.ign.es/ ). Specifically, we considered a total of 20 high-resolution 512 × 512 images from Alicante (Valencian Community, Spain) belonging to four different area types: (i) agricultural, (ii) industrial, (iii) urban and (iv) forest. Figure 5 shows the reference images considered in the experiments.
From these HR images, we generated six different LR datasets by changing the imaging model parameters used in each case. Datasets are described below.
a Gaussian PSF (µ = 0, σ = 1) as a blurring operator (B), a 2x size reduction in the decimation operator (D), with a zero noise level (N = 0). Note that the resulting LR images are 256 × 256 and their resolution is 0.5mpp. • 02LR4xbd: This database has been obtained using a similar configuration as the previous one but using a 4x decimation. In particular, we have consecutively applied it (twice) the imaging model described in '01LR2xbd' to obtain LR images with a size of 128 × 128 and a resolution of 1mpp. • 03LR2xbdn01: The generation of this database follows the same process described in '01LR2xbd' but in this case we have used an additive white Gaussian noise with σ = 0.01. The final images have a size of 256 × 256 and a resolution of 0.5mpp • 04LR4xbdn01: For this database, we have followed the process described in '02LR4xbd' adding a σ = 0.01 white Gaussian noise to the final images. In this case, the images of this database have a size of 128×128 and a resolution of 1mpp. • 05LR2xbdn05: This database has been generated following the procedure described in '03LR2xbdn01' but using a σ = 0.05 additive white Gaussian noise. The images belonging to this database are 256 × 256 and 0.5mpp. • 06LR4xbdn05: The imaging model used for this database is the same that one of '04LR4xbdn01' but considering a σ = 0.05 additive white Gaussian noise, resulting 128 × 128 images with 1mpp.
It should be noted that the parameters to generate these databases have been chosen following the most common scenarios in remote sensing imagery. International -FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 5

.2. Super-Resolution Toolbox
This section provides a brief description of the SR methods considered in the experimental part of the work. Specifically, a total of 27 single-frame super-resolution methods (see Table 1) have been tested over the six aforementioned databases. Besides, Lanczos-3 interpolation algorithm has been used as an up-scaling baseline. All these methods have been selected for this comparative work mainly because they are among the most popular single-frame SR methods in the literature and besides their implementations are publicly available either on the Internet or from the authors. Subsequently, we highlight the main features of the methods in terms of the information provided in Section 3.  • 00IntL3 (Lanczos-3 Interpolation) (Turkowski 1990): In general, Lanczos interpolation uses a normalised version of a windowed cardinal sine function. In the particular case of Lanczos-3, this sinc function is made of 5 lobes and it has shown to offer one of the best compromises in terms of reduction of aliasing, sharpness and minimal ringing (Turkowski 1990). Due to these reasons, we have selected Lanczos-3 interpolation as the baseline function for the experiments. Specifically, we use the implementation provided in Matlab R2013a Image Processing Toolbox. • 01IBP (Iterative Back Projection) (Irani and S. 1991): This algorithm follows the approach presented in Section 3.1.1. We used the implementation provided in (Yang, C., and M.-H. 2014) which uses a 5 × 5 Gaussian kernel with σ = 0.6 to perform the initial image interpolation and a total number of 100 back-projection iterations. • 02FIU (PSF Deconvolution) : This method integrates the scheme introduced in Section 3.1.2 with a feedback-control framework to International -FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 constrain the image up-sampling process. Initially, the LR input image is interpolated to the desired resolution by a bi-cubic interpolation. Then, this up-sampled image is deconvolved using the non-blind deconvolution method proposed in (Shan, J., and Agarwala 2008). The implementation we use for the experiments has been provided by . In particular, this code uses a total number of 10 iterations and the Gaussian PSF has the following parameters, σ = 1.05 for a scaling factor of 2x and σ = 1.5 for a scaling factor of 4x. • 03GPP (Gradient Profile) (Jian, Xu, and Shum 2008): This algorithm exploits the idea shown in Section 3.1.3. The implementation provided by the authors has the following characteristics. First, it uses bi-cubic interpolation to perform the initial image up-sampling process. Second, each gradient profile distribution is fitted to a Generalized Gaussian Distribution (GGD) which adds an extra shape control parameter to the normal distribution. Third, the transformation of the gradient is performed by multiplying the gradient profile of the LR input image (∇I LR ) by a ratio r which depends on the scaling factor (see (Jian, Xu, and Shum 2008) for more details). • 04SRSI (Hybrid) (Glasner, S., and M. 2009): The idea behind this method is based on the hybrid approach presented in Section 3.3. Specifically, the code provided by the authors assumes a Gaussian PSF and it builds a 6level resolution pyramid by blurring and down-sampling the initial LR image using a scaling factor of 1.25. Then, each patch of the input image is searched within the scales from higher to lower resolution using the Nearest-Neighbour algorithm. • 05SRP (Mapping) (Kim and Y. 2010): This approach is a mappingbased method with the following features. In the training stage, a Kernel Ridge Regressor (KRR) is trained using a specific set of LR/HR patch pairs. The implementation used for the experiments has been extracted from the source code of (Yang, C., and M.-H. 2014). • 06VSR (Sparse Coding) (Yang et al. 2010): This method corresponds to the procedure explained in Section 3.2.1. The code provided by (Yang, C., and M.-H. 2014) uses 1 st and 2 nd order derivatives as a feature extractor operator (F ) to represent 5 × 5 patches. Besides, the patch overlapping is set to 4 pixels and the sparsity factor β to 0.2. Finally, the B operator in the post-processing step is considered as a 5 × 5 Gaussian blurring mask with σ = 1. • 07LSE (Hybrid) (Freedman and R. 2011): This work is based on a similar approach to the one detailed in 04SRSI. That is, it exploits the selfsimilarity property but with a slight refinement to optimise the searching process. The implementation provided by the authors uses a custom nondyadic filter bank as a blurring operator and the desired magnification factor is upper approximated by small scaling steps compatible with that filter, 5/4, 4/3 and 3/2. Finally, a bi-cubic decimation is performed to achieve the target resolution. • 08SDS (Sparse Coding) (Dong et al. 2011): This work presents an extension of the procedure explained in 06VSR. In particular, it introduces two improvements: (i) an initial training data clustering process and (ii) an adaptive regularisation in the generation step. In the code provided by the authors the images are initially processed using a Gaussian high-pass filter, patches are 7×7 sized with 5 pixels of overlapping and the number of clusters as well as the number of Auto-Regressive functions per cluster is set to 200. International Journal of Remote Sensing IJRS- SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 • 09DSF (Mapping) (Yang and M.-H. 2013): This mapping method is based on learning several simple functions, instead of a single complex one, to relate high-resolution and low-resolution structures. Eventually, the superresolved image is obtained by using the mapping function associated to its closest cluster, on each input LR patch. The implementation provided by the authors considers a patch size of 7 × 7 and removes each patch mean in order to characterise them. • 10ANR (Neighbourhood Embedding) (Timofte, V., and L. 2013): This technique is an extension of the neighbourhood embedding idea presented in Section 3.2.2. In fact, it aims at fusing both Neighbourhood Embedding (NE) and Sparse Code (SC) approaches. The code provided by the authors uses a patch characterisation based on 1 st and 2 nd order gradient and besides it applies Principal Component Analysis (PCA) to reduce the data dimensionality. The patch size is 3 with 1 pixel of overlapping. • 11GPR (Mapping) (He and W.-C. 2011): Even though we have classified this method as a mapping one because it uses a Gaussian Process Regression (GPR) function to perform the SR task, it could also be considered a reconstruction or even an hybrid method because it extends the mapping approach in order to avoid the use of external training examples. The implementation used in the experiments has been provided by the authors of the method and it has the following features: a patch size of 20 with a 2/3 overlapping factor and a Gaussian PSF with σ = 0.6. • 12BDB (Hybrid) (Michaeli and M. 2014): This method is a de-blurring method that we have adapted to SR by carrying out an initial Lanczos-3 interpolation to the target resolution. We think it is important to test this technique because it follows a similar hybrid approach to the one proposed by (Michaeli and M. 2013) whose implementation is not available. In particular, this de-blurring method takes advantage of the cross-scale patch recurrence property presented in Section 3.3: patches in sharp natural images tend to co-occur across scales whereas patches in blurred images do not. The code of this de-blurring method has been provided by the authors. Specifically, it uses a 5 × 5 patch and a 13 × 13 kernel. The image scaling pyramid is constructed using a 3/4 down-scaling factor and the sinc function until the kernel is smaller than the patch size. • 13SIP (Sparse Coding) (Yang et al. 2008): This algorithm follows the sparse coding approach presented in Section 3.2.1 but simplifying the dictionary training step. In particular, this technique corresponds to an initial version of 06VSR. The main difference between 06VSR and 13SIP lies on the fact that each method learns the dictionaries in a different way. The implementation provided by the authors considers a patch size of 3 with 1 pixel of overlapping. The kind of extracted features, the sparsity factor and the blurring mask in the post-processing step are the same to those considered in 06VSR. • 14SIS (Sparse Coding) (Zeyde, M., and M. 2012): This sparse coding method is an extension of the method explained in 06VSR. Specifically, the most important improvements are the use of PCA in the image characterisation process and the use of an optimised dictionary training step. The code provided by the authors have the same configuration to the one of 13SIP. • 15GLR (Neighbourhood Embedding) (Timofte, V., and L. 2013): This approach is essentially the same to the one described in 10ANR but using different neighbourhood constraints. In particular, this variant uses a global International Journal of Remote Sensing IJRS- SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 neighbourhood to represent each element of the dictionary as a combination of neighbours. The regularisation type and the method parameters are the same to those described for 10ANR. • 16NELS (Neighbourhood Embedding) (Timofte, V., and L. 2013): This method is another variant of 10ANR. In this case, the regression, where each dictionary atom is represented as a linear combination of neighbours, is tackled as a Least-Squares (LS) problem without any kind of regularisation operator. The kind of neighbourhood and method parameters are those of 10ANR. • 17NENNLS (Neighbourhood Embedding) (Timofte, V., and L. 2013): This NE method is a variant of 16NELS where the LS problem is constrained to positive values. The rest of the parameters are the same to those explained in 16NELS. • 18NELLE (Neighbourhood Embedding) (H., T., and Y. 2004): This algorithm is based on the most classical NE approach introduced in Section 3.2.2. This method does not train any dictionary to perform the embedding, that is, the dictionary consists of training patches themselves. The rest of the parameters and settings of this method are the same as those of 10ANR. • 19JOR (Mapping) (Dai, R., and L. 2015): The idea behind this mapping approach is similar to that one described in 09DSF. Specifically, this method is made up of two stages: (i) jointly-optimized regressor learning and (ii) adaptively HR generation. The implementation provided by the authors uses 1 st and 2 nd order derivatives as features, 3 × 3 patches with overlapping and 20 Expectation-Maximisation iterations to cluster the training data. • 20MDL (Sparse Coding) (Purkait and B. 2013): This method presents an extension of the classical SC approach introduced in Section 3.2.1. The rationale behind this algorithm is similar to that of 19JOR but from a sparse coding point of view. The code used for the experiments was given by the authors. It uses a 7 × 7 patch encoded using a Bag-of-Words (BoW) approach over 1 st and 2 nd order derivatives. The number of dictionaries is set to 20 by default and the scaling factor is a constant value of 1.25 which has to be used several times to reach higher up-scaling ratios. Finally, the super-resolved output is post-processed considering a 3 × 3 Gaussian PSF with σ = 0.2. • 21CNN (Mapping) (Dong et al. 2014): This SR technique is inspired by the mapping approach explained in Section 3.2.3. In this case, a Convolutional Neural Network (CNN) is used to relate low-resolution structures to highresolution ones. In the training stage, the method up-scales the LR input images by a bi-cubic interpolation, and then a 3-layer CNN is trained to learn a end-to-end mapping between these interpolated versions and their ground-reference counterparts. • 22BSR (Sparse Coding) (Polatkan et al. 2015): This approach defines a probabilistic graphical model to perform the super-resolution task. Specifically, this Bayesian SR method can be summarised in four main steps: (1) coupled dictionary training, where high-resolution and low-resolution dictionaries are learned, (2) sparse low-resolution factor estimation, (3) superresolved image generation, and (4) a final post-processing step following Eq. (8). The code provided by the authors uses an 8×8 patch size and a Gaussian PSF for the post-processing step. • 23DLU (PSF Deconvolution) (Lucy 1974): In this method, we use the Richardson-Lucy (RL) deconvolution algorithm presented in (Lucy 1974) to tackle the SR problem from a de-blurring point of view (Section 3.1.2). For International Journal of Remote Sensing IJRS- SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 the experiments, we use Lanczos-3 interpolation (00IntL3) to up-scale the initial LR image I LR to I LRI and a Gaussian PSF with σ = 1 to run the RL deconvolution function available in Matlab R2013a. • 24DRE (PSF Deconvolution) (Gonzalez and Woods 2006): This method follows a similar approach to 23DLU but using a different deconvolution algorithm. In this case, we use the regularised deconvolution function provided in Matlab R2013a. This deconvolution technique is a linear method which uses a least-squares solution with a Laplacian regularization to preserve image smoothness. This kind of regularisation aims at eliminating noise artifacts. The configuration used for the experimentation is the same than one described in 23DLU. • 25FSR (PSF Deconvolution) (Zhao et al. 2015): This SR method is an extension of the classical deconvolution approach (Section 3.1.2) where the decimation (D) and blurring (B) operators in Eq.
(1) (imaging model) are unified. In (Zhao et al. 2015), three different regularisation strategies are tested with the proposed technique: (a) 2 -norm (Tikhonov regularization), (b) 2 -norm (Total Variation) and (c) 1 -norm in the Wavelet domain. We tested the code provided by the authors using the Total Variation regularisation (b) because it provides one of the best results in (Zhao et al. 2015). • 26TSE (Hybrid) (Huang, A., and N. 2015): The idea of this method is essentially the same to that presented in Section 3.3 but extending the self-similarity assumption. The implementation provided by the authors uses a 5 × 5 patch size and 20 post-processing iterations with a Gaussian PSF (σ = 1.2). The number of levels to achieve a scaling ratio of 2x and 4x are 3 and 6, respectively. • 27SCN (Mapping) (Liu et al. 2016): This technique is an extension of 21CNN in order to combine the strengths of deep neural network and sparse coding. In particular, this approach defines a Convolutional Neural Network (CNN) layer structure similar to that described in 21CNN for patch extraction and reconstruction. The key difference is based on the kind of unit used to learn the non-linear mapping between interpolated LR patches and their HR counterparts. The code available in the authors' website normalises patches by their mean and variance and uses a HR patch size of 5 × 5.

Experimental Protocol
This section provides the details about how images described in Section 5.1 have been super-resolved using the SR methods appearing in 5.2 and assessed using metrics in Section 4. For the experiments, a total of 6 simulated databases have been generated using 6 different imaging models (described in Section 5.1) over the ground-reference images showed in Fig. 5. Note that each database contains 20 high-resolution ground-reference images (512 × 512) and their low-resolution counterparts (256×256 or 128×128 depending on the database). Each database has been split into 5 folds, columns from (a) to (e) in Fig. 5, in order to follow a crossvalidation protocol for the results. That is, low-resolution images of a specific fold have been super-resolved using the rest of the folds as training and besides image quality metrics have been averaged over folds. Logically, only those SR methods requiring a training stage (image learning methods Section 3.2) have been trained.
Regarding the scaling factor, the target resolution of each experiment has been set to reach the ground-reference resolution, that is, 512 × 512. Note that a scaling factor of 2x has been used in databases with 256 × 256 low-resolution images and a International Journal of Remote Sensing IJRS- SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 factor of 4x in databases with 128 × 128 images. Finally, the quality of the superresolved images I SR has been evaluated using the 9 metrics presented in Section 4 with a full-reference assessment protocol, i.e. using the ground-reference images I HR for those metrics with reference. In addition, the computational time of each super-resolution method has been reported to evaluate the efficiency of the methods Training time has not been taken into account because this process can be carried out off-line.

Results
For each one of the datasets described in Section 5.1: {01LR2xbd, . . .}, the mean value for each figure of merit was obtained considering a 5-fold cross-validation strategy (folds from (a) to (e) of Figure 5). Then, the final figure of merit was the mean over the 4 different types of images (types from (i ) to (iv ) of Figure 5). Two different comparisons were made afterwards: (1) Effect of a change in the resolution factor on the reconstruction quality. We compared the super resolved image quality results obtained when a 2x resolution factor was applied compared to the use of a 4x resolution factor. In this comparison, no noise was added to the images.
(2) Effect of noise on reconstruction quality for a specific super-resolution factor.
Comparisons were made using three of the figures of merit proposed, i. e., SAM, RMSE and PSNR, as given by Eqs. 11, 12 and 13. As explained in Section 4.1, SAM is a measure of the spectral change between two images. On the other hand, RMSE and PSNR are two widely accepted measures when assessing image quality for the case when there is a reference image.
Tables 2 to 7 show the results of the best three SR methods for each type of image (from (i ) to (iv )) for each one of the datasets described ({01LR2xbd, . . .}) and for all the figures of merit analysed in this paper. Baseline method 00IntL3 (Turkowski 1990) is also included. 08SDS (Dong et al. 2011) method is always amongst the best three methods in all the figures of merit that use a reference image for all the datasets but in 05LR2xbdn05 where only in image types (iii ) and (iv ) appears amongst the best three. Tables 2 to 7 also show that the other two best SR methods are 25FSR (Zhao et al. 2015) and 07LSE (Freedman and R. 2011). Method 08SDS (Dong et al. 2011) is a sparse coding type method, whereas 25FSR (Zhao et al. 2015) is of PSF deconvolution type and 07LSE (Freedman and R. 2011) can be considered as an hybrid method. As it was stated in Section 5.2, sparse coding methods are the focus of a considerable attention nowadays due to their promising results. On the other hand, PSF deconvolution methods also work relatively well but tend to be dependent on a good quality PSF kernel characterisation.
Other interesting results in Tables 2 to 7 should be highlighted: (1) none of the best three SR methods, according to the figures of merit with reference image, appear as the best three in terms of computational efficiency; (2) none of them appear as the best three when using figures of merit that do not need reference image (BRISQUE and NIQE); (3) method 12BDB (Michaeli and M. 2014) is one of the best three SR methods when considering the BRISQUE and NIQE figures of merit. However, it is not among the first 10 best SR methods in terms of the figures of merit with reference. This fact is somehow surprising and confirms the need to continue working in the development of new and effective quality measures applied to super resolution.  -FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 In order to make these comparisons in a more meaningful and illustrative way, the data for each comparison case was normalised so that the interpolation method (00IntL3) for the lowest resolution and lowest noise value (in each comparison) acted as baseline measure (and therefore a value of zero is assigned to it), all methods that are better than 00IntL3 would get higher values than zero of the corresponding figure of merit, the best one would be set to 1 and the worst one a value equal to −1. The rest of the values of the remaining methods are linearly rescaled to [−1, +1]. Figure 6(a)-(c) shows the bar plots for the SAM, RMSE and PSNR figures of merit for the comparison between a 2x resolution factor in relation to a 4x resolution factor. Notice that (as it may be expected) an increase in the resolution factor lowers down the reconstruction quality of all the methods. However, method 08SDS (Sparse Coding) (Dong et al. 2011) gives the best result for the three figures of merit (SAM, RMSE and PSNR). Figure 6(b)-(c) also shows that the best three other methods for the RMSE and PSNR figures of merit are 25FSR (Zhao et al. 2015), 07LSE (Freedman and R. 2011) and 03GPP (Gradient Profile) (Jian, Xu, and Shum 2008). Figure 7(a)-(c) shows the effect of Gaussian noise addition on 2x super-resolution with different noise levels. In particular, the effect of the addition of σ = 0.01 and σ = 0.05 white Gaussian noise, respectively, is analysed. As we can see, the addition of noise lowers down the reconstruction results for all methods and all the three figures of merit (SAM, RMSE and PSNR). However, as it also happens in the results shown in Figure 6 the 08SDS (Sparse Coding) (Dong et al. 2011) method provides the best results. Figure 7(b)-(c) also shows that the best three other methods for the RMSE and PSNR figures of merit the same as those of Figure 6, i.e. 25FSR (Zhao et al. 2015), 07LSE (Freedman and R. 2011) and 03GPP (Gradient Profile) (Jian, Xu, and Shum 2008). Figure 8(a)-(c) shows the effect of Gaussian noise addition on 4x superresolution. In particular, the effect of the addition of σ = 0.01 and σ = 0.05 white Gaussian noise on the 4x super-resolution process is analysed. As we can see, the addition of noise decreases the super reconstruction results for all methods and the three figures of merit (SAM, RMSE and PSNR). However, again in the results shown in Figure 6 the 08SDS (Sparse Coding) (Dong et al. 2011) method provides the best results. Figure 8(b) also shows that the best three other methods for the RMSE figures of merit are 25FSR (Zhao et al. 2015), 07LSE (Freedman and R. 2011) and 04SRSI (Hybrid) (Glasner, S., and M. 2009), and methods 25FSR (Zhao et al. 2015), 07LSE (Freedman and R. 2011) and 02FIU (PSF Deconvolution) ) for the PSNR (Figure 8(c)).
Figures 7 and 8 also show that the methods 15GLR (Timofte, V., andL. 2013) to 19JOR (Dai, R., and work relatively worse than the rest of methods (for the RMSE and PSNR figures of merit, above all).
On the other hand, methods 11GPR (He and W.-C. 2011), 12BDB (Michaeli and M. 2014), 13SIP (Yang et al. 2008), 14SIS (Zeyde, M., and M. 2012) and 19JOR) (Dai, R., and L. 2015) are all different among them. However, results are similar and therefore model parameters might be playing a role in the results. International Journal -SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 Codes related to the methods 10ANR (Timofte, V., and L. 2013), 15GLR (Timofte, V., and L. 2013), 16NELS (Timofte, V., and L. 2013), 17NENNLS (Timofte, V., and L. 2013) and 18NELLE (H., T., and Y. 2004) and 19JOR (Dai, R., and L. 2015 were part of the same software toolbox and therefore the default values of the parameters used is shared among them. On the one hand, a 3 × 3 size patch with first and second order derivatives and Principal Component Analysis (PCA) as input for each vector was used in the toolbox for all of them. On the other hand, method 13SIP (Yang et al. 2008) uses a 3 × 3 size patch and method 14SIS (Zeyde, M., and M. 2012) and a dimensionality reduction process through PCA as well.
It seems clear that the ideal patch size depends on the selected methods and images. Each single method uses a different patch size. For instance, method 08SDS (Dong et al. 2011) uses a 7 × 7 size patch, 06VSR (Yang et al. 2010) and 25FSR (Zhao et al. 2015) use a 5 × 5 size patch, 11GPR (He and W.-C. 2011) uses a 20 × 20 size patch and method 22BSR (Polatkan et al. 2015) an 8 × 8 size patch. We should bear in mind that we assume that there is no access to the HR image, there is no way to know the best value for the parameters of each one of the different methods. That is why the complete experimental part was made using the default parameters offered by authors of each method, which were assumed to work satisfactorily in the present case. The same patch size was used for those methods that used a dictionary training step, in order to make a fairer comparison.
Method 12BDB (Michaeli and M. 2014) may have a low performance result because the kernel the method recovers is the one used in the interpolation stage and in this case the deblurring stage is made through an L 2 fitting minimising the number of patches.
Method 09DSF (Mapping) (Yang and M.-H. 2013) is not shown in the plots of Figures 6 to 8 because performance results for this method were so poor it made visualisation of the rest of the methods performance results difficult to be obtained. Figures 9 and 10 show some visual results on databases 01LR2xbd and 02LR4xbd in order to provide a qualitative super-resolution evaluation. Orthoimages are shown in rows and columns are arranged as follows: (1st) HR reference image, (2nd) baseline interpolation, (3rd) best PSNR reconstruction result, (4th) best PSNR learning result and (5th) best PSNR hybrid result.
When considering a 2x up-scaling factor (Fig. 9), method 08SDS (Dong et al. 2011) is capable of recovering high-resolution details that are clearly missing in the rest of SR results. Structured details, like pedestrian crossings of images Fig.  9(i ) and Fig. 9(n), are better-defined and visually closer to their HR counterparts. Other details where no regular structures can be detected, like shrubs appearing in images Fig. 9(d ) and Fig. 9(s), are better-defined as well but sometimes it is possible to perceive a slight noise increase. This observation is also confirmed by quantitative evaluation where 08SDS (Dong et al. 2011) PSNR values tend to be higher for (2nd) and (3rd) images than those for (1st) and (4th).
In the case of Fig. 10 (4x scaling factor), we can see how visual differences among SR results become relatively small. Even though 08SDS (Dong et al. 2011) often seems to obtain the best result, other methods like 07LSE (Freedman and R. 2011) and 26TSE (Huang, A., and N. 2015) tend to generate slightly sharper edges and less noisy surfaces such as in the case of Fig. 10(e) and Fig. 10 -SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 (a) Effect of noise on 2x super resolution factor using SAM metric.
(b) Effect of noise on 2x super resolution factor using RMSE metric.
(b) Effect of noise on 4x super resolution factor using RMSE metric.
(c) Effect of noise on 4x super resolution factor using PSNR metric.   Figure 9.: SR results for the (a) fold of 01LR2xbd database. Orthoimages in rows and SR methods in columns. HR reference image in first column, Lanczos-3 interpolation in second, best PSNR reconstruction result in third, best PSNR learning result in forth and best PSNR hybrid result in fifth. For each method PSNR values in brackets and for each row the best PSNR result is highlighted in bold. There are four group of rows (from (i) to (iv )) which correspond to the four different kinds of orthoimages. Besides, each group contains four rows with the following results: baseline Lanczos-3 interpolation (first row), best reconstruction result (second row), best learning result (third row) and best hybrid result (fourth row). Note that the best results are highlighted in bold and method's identification codes are in brackets within the  There are four group of rows (from (i) to (iv )) which correspond to the four different kinds of orthoimages. Besides, each group contains four rows with the following results: baseline Lanczos-3 interpolation (first row), best reconstruction result (second row), best learning result (third row) and best hybrid result (fourth row). Note that the best results are highlighted in bold and method's identification codes are in brackets within the  database. There are four group of rows (from (i) to (iv )) which correspond to the four different kinds of orthoimages. Besides, each group contains four rows with the following results: baseline Lanczos-3 interpolation (first row), best reconstruction result (second row), best learning result (third row) and best hybrid result (fourth row). Note that the best results are highlighted in bold and method's identification codes are in brackets within the database. There are four group of rows (from (i) to (iv )) which correspond to the four different kinds of orthoimages. Besides, each group contains four rows with the following results: baseline Lanczos-3 interpolation (first row), best reconstruction result (second row), best learning result (third row) and best hybrid result (fourth row). Note that the best results are highlighted in bold and method's identification codes are in brackets within the  -FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016  database. There are four group of rows (from (i) to (iv )) which correspond to the four different kinds of orthoimages. Besides, each group contains four rows with the following results: baseline Lanczos-3 interpolation (first row), best reconstruction result (second row), best learning result (third row) and best hybrid result (fourth row). Note that the best results are highlighted in bold and method's identification codes are in brackets within the

Conclusions
A practical overview and comparison analysis has been presented on the performance of single-frame super-resolution methods applied to remote sensing images. From the set of 27 methods analysed over six simulated image collections, the best four methods are 08SDS (Dong et al. 2011), 25FSR (Zhao et al. 2015), 07LSE (Freedman and R. 2011) and 03GPP (Jian, Xu, and Shum 2008). These methods are of type sparse coding, PSF deconvolution, hybrid and gradient profile type, respectively. Note that experiments have been conducted using the default settings of SR codes in order to perform a comparison as fair as possible considering the high variety of techniques. However, tuning parameters of some methods might improve their corresponding results. Although 08SDS (Dong et al. 2011) shows the best general performance, there are several important considerations to point out. In particular three practical conclusions arise from this work: (i) learning-based methods tend to outperform the rest of SR techniques in remote sensing when considering moderate up-scaling factors and reduced levels of noise, (ii) reconstruction-based and hybrid methods are able to improve their results with higher scaling factors and levels of noise, although PSF deconvolution methods depend on the quality of the PSF kernel characterisation, and (iii) Lanczos-3 interpolation gives better results than expected especially with noisy images.
Learning-based methods use external training sets to learn high-resolution structures, therefore it seems clear that these methods are the best when relevant training examples are available. However, when not enough samples are available, reconstruction and hybrid methods are suggested instead. Within all learning techniques, sparse code methods have shown to be the most effective ones mainly because there is not any neighbourhood or functional constraint in the learning process, as in the case of neighbourhood embeddings or mappings.
In general, the increase in the up-scaling resolution factor (from 2x to 4x) and in the amount of noise (from noiseless to σ = 0.05 additive white Gaussian noise) decreases significantly the performance of all the tested SR methods. Although learning-based methods have shown to perform well under any imaging model, it is important to remark that reconstruction-based and hybrid methods tend to be more robust to high up-scaling factors and levels of noise because their performance International Journal of Remote Sensing IJRS- SR-FernandezBeltranLatorreCarmonaPla-Final-NoBlack-22-12-2016 drops are less pronounced. As differences between LR and HR domains become bigger, it is more difficult to learn a model to effectively fill the gap between LR and HR. Considering both reconstruction-based and hybrid methods, the former type seems to be more robust to noise and the latter more robust to high scaling factors. Finally, it is worthwhile indicating that the baseline method (00IntL3 (Lanczos-3 Interpolation)) provides better results than expected as compared to the most of SR methods analysed in this work.