Deep Metric Learning Based on Scalable Neighborhood Components for Remote Sensing Scene Characterization

With the development of convolutional neural networks (CNNs), the semantic understanding of remote sensing (RS) scenes has been significantly improved based on their prominent feature encoding capabilities. While many existing deep-learning models focus on designing different architectures, only a few works in the RS field have focused on investigating the performance of the learned feature embeddings and the associated metric space. In particular, two main loss functions have been exploited: the contrastive and the triplet loss. However, the straightforward application of these techniques to RS images may not be optimal in order to capture their neighborhood structures in the metric space due to the insufficient sampling of image pairs or triplets during the training stage and to the inherent semantic complexity of remotely sensed data. To solve these problems, we propose a new deep metric learning approach, which overcomes the limitation on the class discrimination by means of two different components: 1) scalable neighborhood component analysis (SNCA) that aims at discovering the neighborhood structure in the metric space and 2) the cross-entropy loss that aims at preserving the class discrimination capability based on the learned class prototypes. Moreover, in order to preserve feature consistency among all the minibatches during training, a novel optimization mechanism based on momentum update is introduced for minimizing the proposed loss. An extensive experimental comparison (using several state-of-the-art models and two different benchmark data sets) has been conducted to validate the effectiveness of the proposed method from different perspectives, including: 1) classification; 2) clustering; and 3) image retrieval. The related codes of this article will be made publicly available for reproducible research by the community.


I. INTRODUCTION
W ITH the ongoing development of different Earth observation missions and programs, the semantic understanding of remote sensing (RS) image scenes plays a fundamental role in many important applications and societal needs [1], including preservation of natural resources [2], urban and regional planning [3], contingency management [4], land-cover analysis [5], and global Earth monitoring [6], among others. From a practical perspective, the RS scene recognition problem consists of predicting the semantic concept associated with a given aerial scene based on its own visual content. In this way, scene-based recognition methods are expected to deal with high intraclass and low interclass variabilities since airborne and spaceborne optical data often comprise a wide variety of spatial structures that lead to a particularly challenging characterization for RS scenes [7].
In the literature, extensive research has been conducted, and a wide variety of scene recognition methods have been presented within the RS field [8], [9]. From handcrafted feature-based approaches [10], [11] to more elaborated unsupervised techniques [12], [13], the inherent complexity of the RS image domain often limits the performance of these traditional schemes when dealing with high-level semantic concepts [14]. More recently, deep-learning methods have shown a great potential to uncover highly discriminating features in aerial scenes [15], being the so-called deep metric learning approach one of the most prominent trends [16]- [18]. Specifically, deep metric learning aims at projecting semantically similar input data to nearby locations in the final 0196-2892 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
feature space, which is highly appropriate to manage complex RS data [19]. Nonetheless, there are multiple factors, e.g., large-scale archives, sensor types, or image acquisition conditions, which still makes the semantic understanding of aerial scenes very challenging, thus motivating the development of new models to effectively learn discriminative CNN-based characterizations for unconstrained land-cover scenes [9].
In order to address all these challenges, this article proposes a new RS scene characterization approach, which provides a new perspective on the traditional deep embedding scheme typically used in land-cover recognition tasks [16], [17]. The main objective of the proposed method consists of learning a low-dimensional metric space that can properly capture the semantic similarities among all the RS scenes based on the CNN-based feature embedding of the whole data collection. Moreover, the learned feature embedding in such metric space has to be effectively generalized by means of out-of-sample RS scenes. To achieve this goal, we first investigate the scalable neighborhood component analysis (SNCA) [20] and further analyze the limitations of this recent method on the discrimination of RS scenes. Then, we develop an innovative deep metric learning approach that has been specifically designed to manage the particular semantic complexity of the RS image domain. Specifically, two main components are involved in this new design: 1) SNCA that aims at discovering the neighborhood structure in the metric space and 2) the cross-entropy (CE) loss that aims at preserving the class discrimination capability based on the learned class prototypes. In addition, a novel optimization mechanism (based on the momentum update for SNCA) is proposed to generate consistent features within each training epoch. In order to demonstrate the effectiveness of our contribution when characterizing RS scenes, we conduct a comprehensive experimental comparison, which reveals that our newly proposed RS scene characterization method provides competitive advantages with respect to different state-of-the-art models in three different RS applications (scene classification, clustering, and retrieval), over two benchmark data sets. The main contributions of this article can be summarized as follows.
1) To the best of our knowledge, this article investigates, for the first time in the literature, the suitability of using the SNCA method for characterizing remotely sensed image scenes while also analyzing its main limitations in RS. 2) We propose a new deep metric learning model specifically designed to characterize RS scenes. Our new approach is able to learn a metric space based on CNN models that preserve the discrimination capability for the highly variant RS semantic concepts. 3) In order to improve the consistency of the feature embeddings generated on the whole data set during training, we propose a novel optimization mechanism based on momentum update for minimizing the SNCA-based losses. 4) Based on three different RS applications, we demonstrate the superiority of our newly proposed method with respect to several state-of-the-art characterization methods over different data sets. The related codes will be released for reproducible research inside the RS community. The rest of this article is organized as follows. Section II reviews some related works and highlights their main limitations when effectively characterizing RS scenes. Section III presents the proposed deep metric learning model for RS. In Section IV, extensive experiments are conducted on several publicly available benchmark data sets. Finally, Section V concludes this article with some remarks and hints at plausible future research lines.

A. RS Scene Characterization
Broadly speaking, three different trends can be identified when characterizing remotely sensed scenes: 1) low-level feature-based techniques; 2) unsupervised approaches; and 3) deep-learning methods. A recent work published in [21] reviews the evolution of feature extraction approaches from shallow to deep by comprehensively evaluating both supervised and unsupervised approaches. The former group of techniques is focused on extracting salient features from the input images using straightforward visual descriptors, such as color, texture, spectral-spatial information, or a combination of descriptors. From the simplest low-level feature-based approaches, which make use of color histograms [10], [22], to the most elaborated techniques that consider texture features as well as gradient shape descriptors [11], [23], [24], all these methods exhibit limitations when dealing with high-level semantic concepts due to the inherent complexity of the RS image domain [14], [25].
In order to enhance the visual characterization and generalization, unsupervised feature learning approaches have been proposed to classify airborne and space optical data. The rationale behind this kind of method is based on encoding the low-level features of the input scene into a higher-level feature space by means of unsupervised learning protocols. For instance, sparse coding [12], [13], topic modeling [26], [27], manifold learning [28], [29], and autoencoders [30], [31] are some of the most recent unsupervised paradigms that have been successfully applied to the RS field. Despite the fact that these and other methods are able to provide performance advantages with respect to traditional low-level feature-based techniques, the unsupervised perspective of the encoding procedure may eventually reduce the intraclass discrimination ability since actual scene classes are not taken into account.
Recently, deep-learning methods have attracted the attention of the RS research community due to their great potential to uncover highly discriminating features in aerial scenes [15]. More specifically, these approaches aim at projecting the input data onto the corresponding semantic label space through a hierarchy of nonlinear mappings and layers, which generates a high-level data characterization useful to classify remotely sensed imagery [32]. For instance, Yao et al. [33] proposed in a stacked sparse autoencoder that extracts deep features used to effectively classify aerial images. Lu et al. [34] also presented in an unsupervised representation learning method based on deconvolutional networks for RS scene classification. With the increasing popularity of CNNs, other authors advocate the use of more complex deep-learning architectures (e.g., AlexNet [35], VGGNet [36], and GoogleNet [37]) to characterize and classify RS scenes. It is the case of Hu et al. [38] who presented two different scenarios to make use of VGGNet: 1) one directly using the last fully connected layers as image descriptors; and 2) another considering an encoding procedure over the last convolutional layer feature maps. Chaib et al. [39] also presented an RS classification method that employs the VGGNet model as feature extractor mechanism. Specifically, the authors adopt a feature fusion strategy in which each layer is regarded as a separate feature descriptor. Zang et al. [40] defined a deep ensemble framework based on gradient boosting, which effectively combines several CNN-based characterizations. Analogously, Li et al. [41] proposed a multilayer feature fusion framework, which takes advantage of multiple pretrained CNN models for RS scene classification. Cheng et al. [42] also developed an RS classification approach using a bag of convolutional features obtained by different off-the-shelf CNN models. For fine-grained land-use classification, Kang et al. [43] exploited multiple CNN models and categorized different types of buildings based on street view images.
Despite the effectiveness achieved by these and other relevant methods in the literature [44], multiple research works highlight the benefits of using deep-learning embeddings to characterize aerial scenes [19]. In general, the so-called deep metric learning approach aims at projecting semantically similar input data sets to nearby locations in the final feature space by means of nonisotropic metrics [45]. As a result, this is a highly appropriate scheme to simplify complex topological spaces (which are often found in the RS data). The unprecedented availability of airborne and space optical data, together with the constant development of the acquisition technology, are substantially increasing the complexity of the RS data and, consequently, its visual interpretation [1]. In addition, the probability of encountering unseen target scenes increases with the data complexity, which also makes the embedding strategy appropriate for transferring the knowledge from the training samples to broader semantic domains [46].
Several works in the most recent RS literature exemplify these facts. For instance, Gong et al. [16] adopted the lifted structured feature embedding approach [47], which defines a structured objective function based on lifted pairwise distances within each training batch. The authors introduced additional diversity-promoting criteria to decrease the metric parameter factor redundancy for RS scene classification. Cheng et al. [17] presented in a simple but effective method to learn highly discriminative CNN-based features for aerial scenes. In particular, the authors imposed a metric learning regularization term on the CNN features by means of the contrastive embedding scheme [48], which intrinsically enforces the model to be more discriminative and to achieve competitive performance. Similarly, Yan et al. [18] proposed a cross-domain extension that aims at reducing the feature distribution bias and spectral shift in aerial shots, considering a limited amount of target samples. Whether the model is created using network ensembles [40] or more elaborated semantic embeddings [16], [17], the special particularities of the RS domain still raised some important challenges when classifying aerial scenes [9]. Specifically, the huge within-class diversity and between-class similarity of RS scenes motivate the development of new operational processing chains to effectively learn discriminative CNN-based characterizations that can obtain better semantic generalization for unconstrained land-cover scenes. Note that there are many factors (such as different sensing dates, instrument positions, lighting conditions, and sensor types) that also affect remotely sensed data and, hence, their semantic understanding.

B. Deep Metric Learning
Deep metric learning methods aim at learning a low-dimensional metric space based on CNN models, where the feature embeddings of semantic-similar images should be close and those of dissimilar images should be separated. The metric space with such characteristics can be learned by applying proper loss functions. Most of the existing deep metric learning methods can be categorized based on two types of loss functions [17], [49]- [51]: 1) the contrastive loss [48] and 2) the triplet loss [52]. Some useful notations as well as the definitions of these two losses are given in the following. Let X = {x 1 , . . . , x N } define as a set of N RS images, and Y = {y 1 , . . . , y N } is the associated set of label vectors, where each label vector y i is represented by the one-hot vector, i.e., y i ∈ {0, 1} C , where C is the total number of classes. If the image is annotated by the class c, the cth element of y i is 1, and 0 otherwise. v i ∈ R D denotes the feature of the i th image x i obtained by a complex nonlinear mapping F (x i ; θ) based on a CNN model, where the set θ represents its learnable parameters. D is the dimension of the feature, and f i is the normalized feature on the unit sphere (i.e., To train the deep metric learning system, a set T with M images is extracted from X . According to this notation, the two aforementioned loss functions can be defined as follows. 1) Contrastive Loss: where h(·) represents the hinge loss function, i.e., h(x) = max(0, x), m is the predefined margin, and l i j is the label indicator satisfying Given an image pair (x i , x j ), the first term minimizes (during the training) the Euclidean distance of the two feature embeddings if they share the same class, and the second term is minimized to separate their distance by a certain margin m if they belong to different classes. 2) Triplet Loss: where f a i , f , the triplet loss is minimized to push the negative image away from the anchor image so that the distance is larger than the distance of the positive pair with a certain margin.

C. Current Limitations in RS Scene Characterization
Most existing deep-learning-based methods for RS scene characterization focus on developing different CNN architectures for improving the classification performance based on the semantic labels predicted by the CNN models. However, only a few works in the RS field have addressed the problem of how to analyze the performance of the learned feature embeddings and the associated metric space. One of such pioneer works is [17], which introduced a novel loss function composed of the contrastive loss and the CE loss for learning discriminative features from RS images. The contrastive loss was also exploited in [53] for encoding synthetic aperture radar (SAR) scene images into low-dimensional features. In [54], an RS image retrieval method was proposed based on the learned metric space by utilizing the triplet loss. Normally, the optimization of CNN models with respect to the contrastive or triplet loss functions is conducted stochastically with minibatches. For the contrastive loss, negative and positive pairs are usually constructed for training the CNN models within each minibatch. Nonetheless, this scheme has an important limitation when considering the inherent semantic complexity of the RS image domain. For example, we assume that each RS image can be seen once during one epoch of training and x i exists in one minibatch for the current training iteration. The positive and negative images with respect to x i in this minibatch can be only seen during the current iteration of training. However, CNN models cannot capture all the other positive and negative images with respect to x i outside the current minibatch during this training epoch, which may lead to insufficient learning due to the particularly high intraclass and low interclass variability of RS images. For the triplet loss, one should build the whole set of possible triplets when training the CNN models, where the number of possible triplets is in the order of O(|X | 3 ) [55]. When considering a large-scale data set (which is often the case in RS problems), sufficiently training CNN models will inevitably lead to a practically unaffordable computational cost.

III. PROPOSED DEEP METRIC LEARNING FOR RS
Our newly proposed end-to-end deep metric learning model for RS scene characterization consists of three main parts. First, a backbone CNN architecture is considered in order to generate the corresponding feature embedding space for the input images. In this article, we make use of the ResNet [56] architecture due to its good performance to classify RS scenes [57]. Second, a new loss function, which contains a joint CE term and an SNCA term, is used to optimize the proposed model in order to address the within-class diversity and between-class similarity inherent to RS scenes. Third, a novel optimization mechanism based on a momentum update is proposed. Our mechanism can preserve the feature consistency within each training epoch better than the memory-bank-based mechanism in [20]. Fig. 1 provides a graphical illustration of our newly proposed deep metric learning approach. In the following, we describe, in more detail, the newly defined loss function and the considered optimization algorithm.

A. Loss Function
The neighborhood component analysis (NCA) [58] is a supervised dimensionality reduction method to learn a metric space through a linear projection of the input data such that the leave-one-out K NN score is stochastically maximized in the metric space. The SNCA [20], built upon the NCA, aims to find a metric space that can preserve well the neighborhood structure based on deep models with scalable data sets. Given a pair of images (x i , x j ) from the training set T , their similarity s i j in the metric space can be modeled with the cosine similarity This means that the image x i selects the image x j as its neighbor in the metric space with a probability p i j as where σ is the temperature parameter controlling the concentration level of the sample distribution [59]. When i = j , p ii = 0 indicates that each image cannot select itself as its own neighbor in the metric space. When i = j , p i j indicates the probability that the image x j can be chosen as a neighbor of the image x i in the metric space and inherited the class label from x i . The higher the similarity between x i and x j , the higher the opportunity that x j can be selected as a neighbor of x i in the metric space and inherited the class label from x i compared with the other images x k . This probability is often termed leave-one-out distribution on T . Based on this, the probability that x i can be correctly classified is where i = { j |y i = y j } is the index set of training images sharing the same class with x i . Intuitively, the image x i can be correctly classified at a higher chance if more images x j sharing the same class with x i are located as its neighbors in the metric space. Then, the objective of SNCA is to minimize the expected negative log-likelihood over T with the definition The gradient of L SNCA with respect to f i is given by wherep ik = p ik / j ∈ i p i j is the normalized distribution of the ground-truth class. Based on the gradient in (8), an optimal solution of (7) will be reached when the probability p ik of negative images (i.e. k / ∈ i ) equals 0. In other words, the similarities between x i and some of the positive images (k ∈ i ) can also be very low in the metric space, as long as there exist other positive images that are the neighbors of x i . On the one hand, this characteristic can be beneficial to discover the inherent locality structure among the images in the metric space, especially if there are intraclass variations in the data set. On the other hand, there is one limitation of SNCA for K nearest neighbors (K NN) classification. Since some of the positive images (k ∈ i ) do not need to be close to x i , their feature embeddings may be closer to those of other negative images in the metric space. As illustrated in Fig. 2(a), the classes A and B are separated, and their intraclass variation can also be discovered, which is represented by the groups of light and dark points. However, given the presence of some out-of-sample images sharing similar features with some images from both classes, they cannot be correctly categorized by exploiting the K NN classifier. One way to solve this problem is to separate the images from the two classes farther away from each other, which is illustrated in Fig. 2(b). With the same feature embeddings as in Fig. 2(a), the out-of-sample images are well recognized by the K NN classifier. To achieve this goal, we introduce the CE loss for learning the classwise prototype to align the images with respect to their associated classes.
The CE loss aims to measure the distance between the distribution of model outputs and the real distribution. In terms of classification, the CE loss is defined as where p c i denotes the probability that x i is classified into the class c, formulated as where w c are the learned parameters from class c. Minimizing the CE loss (9) consists of aligning all the images within the same class with the same vector w c . In that case, images from different classes are separated. At this point, by taking advantage of the two losses, we propose a new joint loss function for learning a low-dimensional metric space, which can preserve the neighborhood structure among the images and also distinguish the images from different classes. The proposed joint function, termed SNCA-CE, is defined as where λ denotes a penalty parameter to control the balance between these two terms.

B. Optimization via Memory Bank
By applying the chain rule, we can obtain the gradient of the joint loss function with respect to f i From (12), we can infer that the feature embeddings of the entire data set are needed for calculating the gradient. Following [20], we exploit a memory bank to store the normalized features, i.e., B = {f i , . . . , f M }, and we assume that these are up-to-date with regards to the CNN parameters θ trained at the tth iteration, i.e., iteration, the gradient of the joint loss function with respect to f i is Then, θ can be learned by using the backpropagation technique, and B can be updated by where m is a parameter used for proximal regularization of f i based on its historical versions. We term this optimization strategy as SNCA-CE(MB). The associated optimization scheme is described in Algorithm 1.

5:
Calculate s i j with reference to B.

C. Optimization via Momentum Update
In the SNCA-CE(MB) optimization scheme, the features in B are assumed to be up-to-date during training. However, this assumption cannot be easily satisfied, especially for scalable data sets. Suppose that the image x i is observed in the first iteration of one training epoch and the associated feature f (1) igenerated by the CNN with the parameters θ (1) -is stored in B. Due to the training mechanism, this image cannot be observed again within the current epoch. Therefore, for the tth iteration, the feature f (t) j associated with image x j -generated by the CNN with θ (t) -would not be consistent with f (1) i , which is generated by a historical state of CNN. Since the optimization of SNCA-CE requires a lookup of the whole set of stored feature embeddings in B for each iteration, such inconsistency may lead to a suboptimal training of the CNN.
To solve this issue, we propose a novel optimization mechanism based on momentum update [60] for the proposed SNCA-CE, termed SNCA-CE(MU). Instead of updating the feature embeddings stored in B, the SNCA-CE(MU) progressively updates the state of the CNN in order to preserve the consistency of the features among all the minibatches of each training epoch. To achieve this, an auxiliary CNN with parameters θ aux is adopted, and θ aux is updated by where m ∈ [0, 1) is a momentum coefficient. It is worth noting that only the CNN with θ is updated by means of backpropagation. The auxiliary CNN with parameters θ aux can evolve more smoothly than the CNN with θ . To this end, the features in B (encoded by the auxiliary CNN) are updated byf wheref i denotes the features generated by the auxiliary CNN. In other words, the features in B are replaced by the features encoded by the auxiliary CNN after each training epoch. The associated optimization scheme is described in Algorithm 2.

A. Data Set Description
In this section, we use two challenging RS image data sets to validate the effectiveness of the proposed methods. In the following, we provide a detailed description of the considered data sets.
1) Aerial Image Data Set (AID) [61]: This data set is an important image collection, which has been specially designed for aerial scene classification and retrieval. In particular, it is made up of 10 000 RGB images belonging to the following 30 RS scene classes: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, and viaduct. Fig. 3(a) shows some example scenes from this data set. All the images are RGB acquisitions with a size of 600 × 600 pixels. In addition, the number of images per class ranges from 220 to 420, and the spatial resolution also varies from 8 to 0.5 m. The AID data set is publicly available. 1 2) NWPU-RESISC45 [9]: This is a large-scale RS data set and contains 31 500 images uniformly distributed in 45 scene types: airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snow berg, sparse residential, stadium, storage tank, tennis court, terrace, thermal power station, and wetland. Fig. 3(b) shows some sample scenes from this data set. All these aerial images are RGB shots with a size of 256 × 256 pixels and spatial resolution ranging from 30 to 0.2 m. This data set is also publicly available. 2

B. Experimental Setup
In order to extensively evaluate the effectiveness of the proposed method, we carry out several experiments from different perspectives, including: 1) image classification based on the K NN classifier; 2) clustering; and 3) image retrieval. 1) Classification: Given an out-of-sample image x * , its feature embedding f * is obtained by applying F (·) with the learned parameter set θ . Based on the Euclidean distance between f * and the other stored embeddings in B, we can obtain the closest K nearest neighbors, and the predicted class y * can be determined based on their classes via majority voting. To evaluate classification performance, we adopt the overall accuracy and classwise F1 score as metrics.
2) Clustering: With the provided set of out-of-sample images, we can generate their feature embeddings based on F (·). Their quality can be assessed by applying a clustering task, such as K -means clustering. If the intraclass features are close and the interclass features are separated in the metric space, they can be well clustered, and the clustered labels can accurately match the ground-truth semantic labels. For 2 NWPU-RESISC45 data set: http://goo.gl/7YmQpK the evaluation of clustering performance, the first measure that we use is the normalized mutual information (NMI) [62], defined as where Y represents the ground-truth class labels, and C denotes the cluster labels based on the clustering method. I (·; ·) and H (·) represent the mutual information and entropy function, respectively. This metric measures the agreement between the ground-truth labels and the assigned labels based on the clustering method. We also calculate the unsupervised clustering accuracy as our second metric, formulated by where l i denotes the ground-truth class, c i is the assigned cluster of image x i , and δ(·) represents the Dirac delta function.
M is a function that finds the best mapping between the cluster assigned labels and the ground-truth labels.
3) Image Retrieval: Image retrieval aims to find the most semantically similar images in the archive based on their distances with regards to the query images. Such distance is measured by evaluating the similarity of the feature embeddings between the query images and the full set of images in the archive in the given metric space. Given the query image, more relevant images can be retrieved based on the feature embeddings generated by a more effective metric learning method. To evaluate the performance in terms of image retrieval, we adopt the precision-recall (PR) curve to substantiate the precision and recall metrics with respect to a variable number of retrieved images.
For these tasks, we randomly select 70% of the benchmark data for training, 10% for validation, and 20% for testing. The clustering task is conducted on the feature embeddings of the test sets generated by the learned CNN model. For image retrieval, the test set is served for querying, and the training set is the archive. The proposed method is implemented in PyTorch [63]. The backbone CNN architecture is selected as ResNet18 [56] for all the considered methods. It is worth noting that other CNN architectures, such as ResNet50, can also be applied with the proposed loss and optimization mechanism. For the sake of simplicity, we utilize ResNet18 in this article. The images are all resized to 256 × 256 pixels, and three data augmentation methods are adopted during training: 1) RandomGrayscale; 2) ColorJitter; and 3) RandomHorizon-talFlip. The parameters D, σ , λ, and m are set to 128, 0.1, 1.0, and 0.5, respectively. The stochastic gradient descent (SGD) optimizer is adopted for training. The initial learning rate is set to 0.01, and it is decayed by 0.5 every 30 epochs. The batch size is 256, and we totally train the CNN model for 100 epochs. To validate the effectiveness of the proposed method, we compare it with several state-of-the-art methods based on deep metric learning, including: 1) D-CNN [17]; 2) deep metric learning based on triplet loss [52], [54], simply termed Triplet hereinafter; and 3) SNCA(MB) [20]. It is worth noting that the original SNCA algorithm is optimized with a memory bank, i.e., SNCA(MB). In order to validate the effectiveness of the proposed optimization mechanism, we also consider our new SNCA(MU) and compare its performance with the original SNCA [20]. For the triplet loss, the margin parameter is selected as 0.2, and the parameters in D-CNN are set to the same values as in the original article. All the experiments are conducted on an NVIDIA Tesla P100 graphics processing unit (GPU). Fig. 4 plots the curves of classification accuracy versus the number of training epochs obtained for different learning methods, using the K NN classifier (with K = 10) as a baseline, and the NWPU-RESISC45 data set. As shown in Fig. 4, in order to achieve an accuracy of 90%, SNCA(MU), SNCA-CE(MB), and SNCA-CE(MU) require less than 20 epochs, while the other tested methods require more than 20 epochs. As the learning curves converge, SNCA(MU), SNCA-CE(MB), and SNCA-CE(MU) reach an accuracy of about 94%, which is around 2% higher than that achieved by the other methods. Among them, the performances of SNCA-CE(MB) and SNCA-CE(MU) are slightly better than that of SNCA(MU), and SNCA-CE(MU) achieves the fastest learning speed. By comparing SNCA-CE 3 with SNCA, the introduction of the CE loss can not only increase the learning speed but also improve the classification obtained by the K NN classifier.

1) Classification:
By comparing the MB and MU optimization mechanisms, we conclude that updating the state of the CNN model can lead to better results than updating the memory bank. We report the overall accuracy of all the methods on the considered test sets in Table I, using various values of K . Consistently with the validation, SNCA-CE(MB) and SNCA-CE(MU) achieve the best classification performance on the two benchmark data sets. Compared with SNCA-CE(MB), the classification accuracy of SNCA-CE(MU) is slightly higher on the NWPU-RESISC45 data set, while it is slightly lower on the AID data set. Since the MU optimization mechanism aims at preserving feature consistency among all the minibatches through each training  epoch, its advantage over MB is more obvious in a large data set, such as NWPU-RESISC45. For the AID data set, there are not many minibatches within one training epoch, e.g., around 28 when the batch size is 256. The obtained feature embeddings in B may not vary severely within each training epoch. Thus, the associated performance is comparable with that of the MU mechanism. In turn, SNCA-CE can obtain more accurate performance, with more than 1% improvement compared with SNCA and more than 2% compared with the other two methods. With the adoption of momentum update, SNCA(MU) achieves an accuracy improvement of around 0.5% with regards to SNCA(MB). Moreover , Tables II and III show the classwise F1 scores achieved by the different learning methods (based on the K NN classifier) in the test sets of the AID and NWPU-RESISC45 data sets, respectively, using K = 10. For the AID data set, the F1 score of SNCA-CE(MB) on Resort class achieves more than 5% performance gain than the other methods. For the NWPU-RESISC45 data set, we can see that the performances of most classes obtained by SNCA-CE are the best ones compared with the others.
In addition, Fig. 5(a) and (b) illustrates the similarities of the feature embeddings generated by D-CNN, Triplet, SNCA(MB), and SNCA-CE(MB) on the test sets of the AID and NWPU-RESISC45 data sets, respectively. The similarity is measured by applying the cosine distance, i.e. f * i f * j . As shown  by the obtained similarity matrices, higher color contrast between the diagonal blocks and the background demonstrates higher dissimilarity between the images from one class and those from the others in metric space. In terms of cosine distance, both SNCA(MB) and SNCA-CE(MB) achieve better performances than D-CNN and Triplets when distinguishing between different classes in metric space.
2) Clustering: Table IV shows the NMI scores obtained after applying K -means clustering (with different learning methods) to the feature embeddings of the considered test sets. It can be observed that SNCA-CE achieves the best matching between the ground-truth labels and the pseudolabels assigned by K -means clustering, which results in more than 5% performance gain with regards to the D-CNN. Table V reports the associated ACC scores obtained after using different learning methods. Consistent with the NMI results, the K -means clustering based on features generated by SNCA-CE can make the best label assignment unsupervisedly. In order to obtain further insight on the feature embeddings in the metric space, we exploit the t-distributed stochastic neighbor embedding (t-SNE) to visualize their projections in a 2-D space. Fig. 6 shows the t-SNE scatter plots of the feature embeddings obtained for the AID test set using: 1) D-CNN; 2) Triplet; and 3) SNCA-CE(MB). As illustrated in Fig. 6, the intraclass features are more compact, and the interclass features are more isolated in the proposed method. As a result, clustering methods can more easily discover the inherent structure of the feature embeddings in the metric space produced by the proposed method, resulting in an NMI score that is higher than the one obtained by the other learning methods.
3) Image Retrieval: Fig. 7 shows the PR curves describing the obtained image retrieval results from a given test set used for querying, where Fig. 7(a) and (b), respectively, provides the results for the AID and NWPU-RESISC45 data sets. In order to facilitate the comparison, a zoomed-in subplot is also highlighted. It can be seen that both SNCA and SNCA-CE exhibit superior performance with regards to Triplet and D-CNN as the number of retrieved images increases. As shown in the zoomed-in subplots, the introduction of the CE loss can further improve the precision and recall performances based on SNCA. For the SNCA-based methods (SNCA and SNCA-CE), the similarities of the images within one minibatch during training are compared with all the other images in the data set so that the CNN model can be sufficiently optimized. As a comparison, for the contrastive loss utilized in D-CNN, the negative and positive image pairs are just sampled within each minibatch. For the other images outside this minibatch, the corresponding negative and positive image pairs cannot be constructed, leading to insufficient training of the CNN model. This is actually similar with respect to triplet loss. To make the CNN model capture the similarity and dissimilarity of all the images, one should make a triplet set with about O(|T | 3 ) triplets, which is impossible for a scalable data set. Such limitation of the trained CNN model based on contrastive and triplet losses may lead to the fact that that some images cannot be well separated with regards to other images with different classes or that these images cannot be effectively grouped together with their relevant ones. This phenomenon can be observed in Fig. 6, where some clusters shown in (a) and (b) are entangled with others. In addition, this also leads to the important phenomenon that the image retrieval performance that can be achieved using both SNCA and SNCA-CE is superior to that of the methods based on the contrastive and triplet losses. With respect to SNCA, by introducing the CE loss, SNCA-CE can further improve the image retrieval performance due to its enhanced class distinction capability. Fig. 8 gives some retrieval examples with D-CNN, Triplet, and the proposed method. Given two images from the two test sets, we present their top-five nearest neighbors in the archive. As shown in Fig. 8(a), Park and School are confused with Resort in the Triplet retrieval from the AID data set. The freeway in NWPU-RESISC45 is confused with overpass by D-CNN, as shown in Fig. 8(b).

4) Parameter Sensitivity Analysis of SNCA-CE:
There are three main parameters in the proposed methods, i.e., D, σ , and λ, where D determines the dimensionality of the feature embeddings in the metric space, σ controls the compactness of the sample distribution, and λ balances the contributions of two loss terms, i.e., SNCA and CE. Table VI demonstrates the effectiveness of the K NN classification based on SNCA-CE(MB) with respect to different values of D, assuming that K = 10. As shown in Table VI, the classification performance is robust to different values of D on both data sets. This is greatly beneficial for embedding large-scale data sets since features with small dimensionality can also achieve high-quality classification performance. Based on the K NN classification results with K = 10, we also report the effectiveness of SNCA-CE(MB) in terms of σ in Table VII. Within a range of values from 0.05 to 0.2, the classification results are stable. This suggests that the proposed method is relatively insensitive to the choice of σ (in the range from 0.05 to 0.2) for the two considered data sets. Fig. 9 shows a sensitivity analysis of λ in (11). It can be seen that the K NN classification performs worst on both data sets when λ is near zero, i.e., λ = 0.1. This indicates that the optimization of the SNCA term can indeed improve the metric  learning performance. When λ is larger than 0.1, the proposed method shows its insensitivity with respect to the setting of λ.

V. CONCLUSION
In this article, we introduce a new deep metric learning approach for RS images, which improves scene discrimination by means of two different components: 1) SNCA that aims at constructing the neighborhood structure in the metric space and 2) the CE loss that aims at preserving the class discrimination capability. Moreover, we propose a novel optimization mechanism based on momentum update for SNCA and SNCA-CE. This mechanism is intended to preserve the consistency among all the stored features during training, which represents a highly innovative contribution to characterize the RS scenes.
The conducted experiments validate the effectiveness of the proposed method from different perspectives, including RS scene classification, clustering, and retrieval. Compared with the state-of-the-art models, the newly defined SNCA-CE loss is able to group semantically similar RS images better than other existing approaches due to the effective use of an off-line memory bank. Besides, SNCA-CE can further improve the class discrimination ability based on its learnable category prototypes. The proposed MU optimization mechanism also makes the features generated in each minibatch more consistent within one training epoch than those generated via the MB mechanism. Such a characteristic can be greatly beneficial when processing scalable data sets.
In addition to characterizing RS scenes, our newly proposed deep metric learning framework also exhibits the potential to be used in other tasks, such as dimensionality reduction of RS hyperspectral images and fine-grained land-use or land-cover classification. As a possible future work, one can extensively analyze the influence of different backbone networks (e.g., VGG16, ResNet18, ResNet50, and ResNet101) on the performance of the proposed approach. In addition, we will explore the adaptation of our method to the aforementioned problems and also further evaluate its capacity to perform scene classification with limited supervision.