Robust Normalized Softmax Loss for Deep Metric Learning-Based Characterization of Remote Sensing Images With Label Noise

Most deep metric learning-based image characterization methods exploit supervised information to model the semantic relations among the remote sensing (RS) scenes. Nonetheless, the unprecedented availability of large-scale RS data makes the annotation of such images very challenging, requiring automated supportive processes. Whether the annotation is assisted by aggregation or crowd-sourcing, the RS large-variance problem, together with other important factors [e.g., geo-location/registration errors, land-cover changes, even low-quality Volunteered Geographic Information (VGI), etc.] often introduce the so-called label noise, i.e., semantic annotation errors. In this article, we first investigate the deep metric learning-based characterization of RS images with label noise and propose a novel loss formulation, named robust normalized softmax loss (RNSL), for robustly learning the metrics among RS scenes. Specifically, our RNSL improves the robustness of the normalized softmax loss (NSL), commonly utilized for deep metric learning, by replacing its logarithmic function with the negative Box–Cox transformation in order to down-weight the contributions from noisy images on the learning of the corresponding class prototypes. Moreover, by truncating the loss with a certain threshold, we also propose a truncated robust normalized softmax loss (t-RNSL) which can further enforce the learning of class prototypes based on the image features with high similarities between them, so that the intraclass features can be well grouped and interclass features can be well separated. Our experiments, conducted on two benchmark RS data sets, validate the effectiveness of the proposed approach with respect to different state-of-the-art methods in three different downstream applications (classification, clustering, and retrieval). The codes of this article will be publicly available from https://github.com/jiankang1991.


I. INTRODUCTION
I N RECENT years, the fast development of satellite sensor technology has created great opportunities to exploit remote sensing (RS) data in a wide range of important applications, such as target identification [1]- [6], land-cover analysis [7]- [14], ecosystem monitoring [15]- [18], agriculture [19]- [21], and demographics [22], [23]. In these (and many other relevant tasks [24]), the adequate characterization of RS scenes plays a key role for the semantic recognition of objects as well as the characterization of their spatial topology, due to the particular complexity of the RS image domain [25], [26].
In the literature, a large variety of methods have been developed to effectively characterize RS scenes and classify or retrieve their visual content in a satisfactory way [27], [28]. Among the most popular approaches, it is possible to find handcrafted feature-based methods [29]- [31], unsupervised characterization models [32]- [34] and deep learning-based techniques [35]- [38]. Regarding traditional handcrafted and unsupervised methods, these approaches typically exhibit limited performance within the RS field, owing to the inherent constraints of low-level descriptors and unlabeled data [39]. In contrast, the great potential of convolutional neural networks (CNNs) to uncover highly discriminating features from RS scenes makes deep learning one of the most prominent and successful trends [40]. Specifically, the so-called deep metric learning approach has recently shown excellent results to characterize complex RS data [41]- [46]. In general, deep metric learning pursues to project semantically similar images to nearby positions in the corresponding CNN-based metric space while separating dissimilar images according to their semantic annotations. Consequently, these techniques generally require vast amounts of supervised data for properly learning the complex semantic relationships associated with aerial scenes [42].
With the proliferation of different Earth Observation missions, the big data era of RS is a present-day reality [47]- [50]. However, the annotation of massive data becomes a major challenge in RS, since manually generating relevant ground-truth information is very expensive and timeconsuming, which often makes the process unaffordable from an operational perspective, and constrains the availability of labeled RS data for deep learning-based applications. In order to relieve this problem, two main annotation strategies have been adopted in the RS field: 1) aggregation and 2) crowdsourcing. On the one hand, aggregation techniques [51], [52] make use of some sort of unsupervised methodology to group the data into a limited number of clusters. Then, each group of samples is manually annotated with the same semantic labels to reduce the final effort in processing the whole archive. On the other hand, crowd-sourcing methods [53], [54] take advantage of the geospatial semantic information available in different crowd-sourcing platforms [e.g., Google Maps, OpenStreetMap (OSM), CORINE Land Cover (CLC), etc.] to automatically generate the corresponding semantic annotations for the aerial scenes according to their geographic coordinates [55].
Whether the RS data are labeled using aggregation or crowd-sourcing, both procedures inevitably introduce label errors due to the RS large-variance problem, as well as other important factors. That is, the high intraclass and low interclass variability inherent to RS data [25] may produce that semantically similar scenes could be differently grouped and consequently mis-labeled. Besides, additional factors, such as geo-location/registration errors, land-cover changes or even low-quality volunteered geographic information (VGI), could also introduce label noise, i.e., scenes which are annotated with semantic labels different to the real ones. Precisely, all these deviations may potentially degenerate any supervised RS image characterization scheme, being deep metric learning no exception [42]. In fact, the frequent complexity of deep metric learning models (with many hyperparameters) makes them particularly prone to suffer degradation by such noisy labels [56]. Although some efforts have recently been carried out in the literature to improve deep learning-based classifiers for some computer vision [57]- [59] and RS [60] applications, the lack of research on error-tolerant characterization methods within the RS field motivates the development of new deep metric learning models to effectively characterize complex RS archives with label noise.
In order to overcome these limitations, this article proposes a new RS image characterization approach, based on the newly defined robust normalized softmax loss (RNSL), which has been specially developed to deal with RS scenes with noisy annotations. Specifically, we first investigate the general application of deep metric learning to characterize airborne and spaceborne images with label noise. After analyzing the state of the art, we formulate the proposed RNSL by revisiting the normalized softmax loss (NSL) based on the negative Box-Cox transformation [61], with the objective of naturally reducing the contribution of those images with potential label noise when learning the corresponding class prototypes. Additionally, we also define an extension of the proposed loss, named the truncated robust normalized softmax loss (t-RNSL), to enforce learning the class prototypes based on the image features with higher similarities. In this way, intraclass feature variations can be further reduced and interclass features can be better separated in the resulting embedding space. To demonstrate the effectiveness of our contributions, we perform an extensive experimental comparison, involving two different RS benchmark archives and three downstream applications (K -nearest neighbor (NN) classification, clustering, and retrieval), that confirms the advantages of the proposed approach to characterize RS data with noisy annotations with respect to different state-of-the-art deep metric learning methods. Summarizing, the contributions of this article can be condensed into the following points: 1) To our best knowledge, we investigate, for the first time in the literature, the problem of deep metric learning-based RS image characterization with label noise, exposing that noisy annotations may have a high impact on state-of-the-art losses when characterizing RS scenes. 2) We propose a new loss function (RNSL) and its truncated extension (t-RNSL) for increasing the noise-tolerant capability of the RS image characterization framework based on deep metric learning. 3) We widely explore how the presented approach performs in different downstream applications (K -NN classification, clustering, and retrieval) over several benchmark RS archives with label noise. This provides important insights about the working mechanism and advantages of the proposed losses with respect to other state-ofthe-art functions, especially under heavy uniform label noise. The remainder of this article is structured as follows. Section II describes some related works while highlighting the novelty of this work. Section III presents our deep metric learning-based RS image characterization scheme, including the two newly proposed loss functions. Section IV reports the conducted experimental comparison and discusses the results. Finally, Section V concludes the work and provides some hints at plausible future research lines.

A. RS Image Characterization
Generally, existing RS scene characterization methods can be categorized into three different types depending on the nature of the considered features [28]: 1) handcrafted; 2) unsupervised; and 3) deep learning-based. In handcrafted feature-based methods, low-level visual descriptors are extracted to represent RS images according to different elementary features, such as color, texture, and shape. For instance, some of the most popular descriptors used in RS are color histograms, local binary patterns, and scale-invariant feature transform (SIFT) [29]- [31]. Despite their advantages, these straightforward methods are often unable to provide satisfactory results to characterize airborne and spaceborne optical data owing to the high semantic complexity of the RS image domain [25]. To improve the generalization capability, unsupervised methods make use of different unsupervised learning paradigms to encode the extracted features into a higher-level feature space. Among the most representative techniques used in RS, we can find sparse coding, topic modeling, and autoencoders [32]- [34]. Despite the positive results achieved by these and other unsupervised alternatives, the lack of supervised information generally reduces their intraclass discrimination ability which may eventually become an important limitation to deal with the large-variance problem in RS [39].
The recent development of deep learning technologies has attracted the attention of the RS research community owing to the excellent capabilities of CNNs to extract highly discriminating features from visual data [40]. In particular, the objective of these models is based on projecting the input data onto its corresponding label space using multiple nonlinear mappings and layers that are able to produce high-level characterizations very useful in RS [62]. For instance, it is the case of Li et al. who present in [63] an RS image classification approach which integrates multilayer features of different pretrained CNN models for characterizing the aerial scenes. Liu et al. also developed in [64] an image characterization method for RS that exploits multiscale CNN features based on spatial pyramid pooling. Similarly, Zheng et al. [65] proposed a deep scene representation technique that makes use of pretrained CNN features, multiscale pooling, and Fisher vectors to characterize RS images.
Notwithstanding the good performances of these and other related approaches [66], the so-called deep metric learning scheme has recently become a prominent trend to effectively represent RS scenes. Deep metric learning pursues to project semantically similar images to nearby locations in the resulting CNN-based characterization space. As a result, this framework becomes highly suitable for modeling the complex semantic relationships inherent to large-scale variance RS data [28]. In general, it is possible to categorize most of the existing deep metric learning methods based on two formulations: 1) the contrastive loss and 2) the triplet loss. On the one hand, the contrastive embedding [67] is trained with pared data to minimize the distance between the two samples if they share the same class and to increase such distance (by a certain margin) if they belong to different classes. On the other hand, the triplet loss [68] considers triplets of samples (i.e., anchor, positive, and negative) with the objective of minimizing the distance between the anchor and its positive exemplar and also pushing the negative sample away from the anchor by a certain margin. Different works in the RS literature exemplify these two alternatives. For instance, Cheng et al. [42] developed the discriminative CNN (D-CNN), which imposes a contrastive-based metric learning regularization over an offthe-shelf CNN architecture. Following a similar inspiration, Yan et al. [43] presented a cross-domain adaptation based on hybrid color features to reduce the bias of the data distribution in the resulting embedding space. Cao et al. [69] made use of the triplet loss formulation to define a content-based RS image retrieval framework, that considers both positive and negative aerial scenes when generating the embedding space.

B. Deep Metric Learning With Label Noise
In spite of the advantages of the deep metric learning scheme to characterize RS scenes, the task of sampling informative pairs or triplets from large-scale RS archives becomes very challenging because the probability of considering samples with inconsistent semantic annotations and relationships is logically affected by the data volume and noise [42]. Note that both factors (the data size and the presence of label noise) are important problems in RS due to the unprecedented availability of massive RS data, together with the operational limitations of annotating such large-scale image archives [47], [48]. Precisely, different strategies have been developed in the literature to alleviate these problems. For example, it is the case of the scalable neighborhood component analysis (SNCA) [70]. Specifically, this approach is built upon the neighborhood component analysis (NCA) [71] by including an augmented nonparametric memory for training the CNN model with a larger data scope to produce more general and robust features. Zhai and Wu [72] proposed the NSL which considers a classification-based metric learning approach to characterize and retrieve images by content. In more detail, NSL pursues to maximize the agreement between the corresponding class prototypes and the associated features of the same class in order to make the most challenging samples more relevant during training, which may certainly enhance the generalization capability of the model. Additionally, Deng et al. [73] presented the additive angular margin (ArcFace) to maximize the class separability while producing highly discriminating image features. In particular, ArcFace uses the arc-cosine function with an additive angular margin to optimize the distance between features and target weights with the objective of establishing the training process under the most complicated situations, e.g., intraclass large-variance and label noise. Yuan et al. [74] also defined an alternative distance metric based on the signalto-noise ratio (SNR) to improve the discrimination ability and feature robustness. Despite all the conducted research, there is a lack of deep metric learning methods specifically designed for characterizing large-scale RS archives with label noise, which motivates the development of new error-tolerant models to account for the particular complexity of the RS image domain (where instrument types, sensing positions, or atmospheric effects may also generate important semantic deviations [28]).

C. Novelty of Our Work
In order to face these challenges, this article presents a novel deep metric learning scene characterization method which is particularly designed to deal with large-scale RS archives with noisy annotations. Whereas state-of-the-art deep metric learning models try to improve their robustness and generalization capability by exploiting data diversity and separability (e.g., [42], [68], [70], [72], [73]), even a few mislabeled RS scenes on class boundaries may have a strong impact on the final embedding space, since some decision boundaries could be easily modified by possible noisy label fluctuations. Note that the large-scale nature and inherent semantic complexity of RS data make the problem of label noise particularly challenging over these transitional regions. In this context, the proposed approach aims at reducing the negative effect of label noise over such critical regions by means of two novel loss functions: RNSL and its truncated extension t-RNSL. More specifically, we take advantage of the negative Box-Cox transformation [61] to enforce normal and symmetry conditions that allow the proposed RNSL loss to reduce the contribution of samples belonging to semantically uncertain regions that are particularly affected by noisy labels. Additionally, t-RNSL further improves the model robustness to label noise by thresholding the contribution of potentially mislabeled RS scenes.
Unlike other works available in the literature, the proposed approach gathers two important facets for the RS domain in an innovative manner: data scalability and label noise. On the one hand, we present an RS image characterization method which is built upon the rationale of the NSL function, owing to its prominent generalization capabilities with large-scale data [72], which is certainly a key factor in RS [47], [48]. On the other hand, we formulate two novel loss functions (RNSL and t-RNSL) in order to account for label noise when learning the corresponding RS image embeddings, since the presence of erroneous annotations is an important problem in the most challenging RS archives. In contrast to NSL [72], the proposed approach has been developed assuming that existing labels can be corrupted by noise. Hence, we integrate an effective mechanism to control the contribution of such noise to the gradient update with the aim of not misleading the process of learning the corresponding class prototypes. In other words, we reformulate the standard normalized version of the soft-max loss in order to deal with the particular requirements of RS image collections with label noise. When compared to different state-of-the-art techniques, the presented RS image characterization model is able to provide remarkable performance improvements with respect to the methods in [42], [68], [70], [72], [73], [75], which also indicates the novelty and advantages of the proposed approach when dealing with label noise.

III. ROBUST NORMALIZED SOFTMAX LOSS
The proposed deep metric learning method for characterizing RS scenes with noisy labels mainly contains two parts: 1) a backbone CNN architecture for encoding the RS images into the associated features of a low-dimensional metric space and 2) a new loss function (RNSL) and its truncated extension (t-RNSL) for robustly learning the distance metrics of the RS images with label noise. Fig. 1 provides a graphical illustration of the proposed framework, where deep features, class prototypes, and noisy labels are involved in the defined loss formulation. We describe all the framework details, including the considered notations (Section III-A), a technical analysis of NSL (Section III-B) and the proposed losses (Section III-C).

A. Notations
Let X = {x 1 , . . . , x N } be an RS image data set consisting N images with category labels, and Y = {y 1 , . . . , y N } be the associated set of labels, where each label is denoted by a one-hot vector, i.e., y i ∈ {0, 1} C and C is the total number of categories. When the image x i is annotated with the cth class, the cth element of y i is 1, i.e., y c i = 1, and the other elements are 0. In the context of deep metric learning, we denote F (·) as the CNN model which encodes the input image x i into a low-dimensional feature f i ∈ R D with a dimension size of D. In this article, the features are normalized, i.e., Considering the label noise, we denote the noisy label set byŶ = {ŷ 1 , . . . ,ŷ N }, whereŷ i represents the noisy label vector. Here, we also assume that the noise is conditionally independent of input images given the true labels [76] p(k|c, where η ck describes the noise rate, drawn as the (c, k)th component from a C × C probability transition matrix Q [77]. Two different kinds of noise are considered in this article, including uniform noise, where a true label is randomly flipped into other labels with equal probability η ck = (η/(C − 1)) or preserves as the true label with the probability η ck = 1 − η, and label-dependent noise, where a true label is more likely to be mistakenly labeled with a particular class with the probability η ck = η or preserves as the true label with the probability η ck = 1 − η. For example, Fig. 2

B. Details on NSL
One of the state-of-the-art methods for deep metric learning is based on NSL [72], which is formally described as where w c ∈ R D denotes the normalized weight vector of the class c, (i.e., w c 2 = 1), and σ is the temperature parameter that controls the concentration of the sample distribution.
, minimizing L NSL given the true labels tends to maximize the agreement between w c and the associated features of the same class. Therefore, w c is often termed as class prototype. Let p c i represent the probability that the feature F (x i ) is aligned with the cth class prototype among all the others, (i.e., p c i = ((exp(w T c F (x i )/σ ))/( k exp(w T k F (x i )/σ )))). By calculating the gradient of L NSL with respect to w c , we can obtain For the images belonging to the cth class, when their features are not well aligned with respect to w c , larger weights (y c i / p c i ) are enforced on the term of (∂ p c i /∂w c ). In other words, during the learning phase of w c , hard images receive more attention than the images that can be easily discriminated, and they contribute more on the gradient update of w c . However, when the true labels are corrupted by noise, the wrongly categorized images can dominate the contribution of the gradient update, which will mislead the learning of the associated class prototype. As illustrated in Fig. 3, the produced class prototype may be closer to the features of the wrongly categorized images in the feature space. This will also limit the learning of the CNN models which are utilized for generating the features for the input RS images, since they can be positioned by the models toward the features from different classes in the feature space.

C. RNSL
To solve these problems, we propose a novel loss formulation for deep metric learning which is more robust for images  with noisy labels. Inspired by the results achieved in other domains [76], we utilize the negative Box-Cox transformation [61] due to its normal and symmetrical properties as the loss function with the expression Differently with respect to the generalized cross entropy (GCE) proposed in [76], RNSL has the capability of robustly learning the distance metrics of the images by enforcing the alignment between the class-wise prototypes and the associated image features in the feature space. By calculating the gradient of L RNSL with respect to w c , we can obtain Due to p c i ∈ [0, 1] and q ∈ (0, 1), ( p c i ) q has a down-weighting effect on −(1/ p c i )(∂ p c i /∂w c ) for each image. We plot the values of p q with respect to the variations of p, when q = 0.1, 0.3, 0.5, 0.7, in Fig. 4(a). When p decreases from 1 to 0, the decreasing speed of p q becomes faster. In other words, for the images with noisy labels, smaller weights ( p c i ) q are imposed on −(1/ p c i )(∂ p c i /∂w c ) compared with the other images. Thus, the optimization of w c is more dependent on the gradients calculated on the images with the truth labels than those with noisy labels. From the loss function perspective, we display the values of ((1 − p q )/q) (q = 0.1, 0.3, 0.5, 0.7) and − log( p) with respect to the different values of p, in Fig. 4(b). Compared to the loss function − log( p) utilized in NSL, ((1 − p q )/q) of L RNSL puts less emphasis on the smaller values of p. Therefore, minimizing the loss values contributed from the images with noisy labels will not give more performance gain, which will improve the robustness for learning w c and F (·) against label noise.
Lemma 1: lim q→0 L RNSL = L NSL . Proof: Based on L'Hôpital's rule, we have [76] lim As can be observed, the smaller the value of q will introduce higher approximation of L RNSL with respect to L NSL . Therefore, we can consider the proposed loss function L RNSL as the robust version of NSL against label noise for deep metric learning.
Although the loss values of L RNSL with respect to small values of p c i are suppressed, the images with noisy labels still contribute to the learning of class prototypes w c . In order to further improve the robustness capability of L RNSL , it is better to avoid the updating of w c influenced by the gradient directions produced by the images with noisy labels. To achieve this goal, a truncated version of L RNSL is also proposed where k ∈ (0, 1) represents a threshold value. When p c i is less than a certain threshold k, the loss induced by p c i is cut to a certain value. This will lead to the zero gradients of L t−RNSL with respect to w c . Thus, the associated p c i cannot make any contributions to the learning of w c . In other words, the images with noisy labels have a great potential to be "thrown out" during the learning phase based on L t−RNSL . To practically optimize the L t−RNSL , it can be further formulated as follows: where [·] denotes the Iverson bracket function, which takes 1 for those input values that make the argument statement true and 0 otherwise. In practice, one can store an indicator array representing whether p c i > k is triggered for each image x i . However, at the initial state of CNN models (less than a certain number of training epochs), the generated features from the images cannot be discriminative enough to be exploited for "filtering out" some plausible images with noisy labels. In this regard, the training of CNN models based on L t−RNSL is conducted with the following two steps: 1) Within the first T epochs, the CNN models are trained with L RNSL . 2) After T epochs, the loss function is switched to L t−RNSL . The first step is to utilize L RNSL to train the CNN models until a state that the generated features can be discriminative, and then L t−RNSL takes action to prune some hard images for fine-tuning the state of CNN models which may not be affected by some noisy images.

A. Experimental Setup
We conduct extensive experiments based on two RS benchmark data sets including: 1) Aerial Image Data set (AID) 1 [78] and 2) NWPU-RESISC45 2 [28]. We specifically select these two collections because they are both complex RS image archives in terms of data volume and semantic complexity, that became particularly challenging while reliable under label noise. For details about the data sets, we refer the readers to the associated articles. We randomly split the data sets into training, validation and test sets with percentages of 70%, 10%, and 20%, respectively. For the training sets, the associated labels are corrupted by the uniform and label-dependent noises with the noise rate η equals to 0.1, 0.3, 0.5, and 0.7. For example, when η = 0.5, we plot the probability transition matrix Q for the uniform and label-dependent noise based on the AID labels in Fig. 5(a) and (b), respectively. For uniform noise, the probability that the original labels are preserved is 1 − η, and they are randomly changed to other labels with the equal probability (η/(C − 1)). For label-dependent noise, we preserve the original labels with a probability of 1 − η. Then, each noisy label is flipped into another class according to the probability transition matrices detailed in the appendix section. Note that, in the case of label-dependent noise, we design Q in such a way that noisy annotations are based on the semantic similarities among the land-use or land-cover classes, in order to make the mislabeling process as realistic as possible. As an example, Fig. 5(b) shows the probability transition matrix for AID when η = 0.5. Besides, Fig. 6 demonstrates the number of class-wise true and noisy labels of the AID data set when the uniform noise exists with η = 0.3. The validation and test sets are exploited for Fig. 6. Class-wise numbers of true and noisy labels on the AID data set when η = 0.3. the evaluation in the training and test phases, respectively. To evaluate the performance of the proposed method on the generation of distance metrics among RS images, we carry out several downstream tasks including: 1) K -NN classification; 2) clustering; and 3) image retrieval.

1) K NN Classification:
For the test images, their labels can be determined by a majority voting based on their K -NNs retrieved from the training set via the measurement of the associated Euclidean distances in the feature space. Note that, in this evaluation phase, the truth labels of the training set are exploited for K -NN classification. We evaluate the classification performance based on the overall accuracy (ACC).
2) Clustering: We apply K -means clustering on the extracted features from the test images. Then, the clustering results are evaluated by normalized mutual information (NMI) [79] and unsupervised clustering ACC described as follows: where Y represents the ground-truth class labels, and C denotes the cluster labels based on the clustering method. I (·; ·) and H (·) represent the mutual information and entropy function, respectively where l i denotes the ground-truth class, c i is the assigned cluster of image x i , and δ(·) represents the Dirac delta function. M is a function than finds the best mapping between the cluster assigned labels and the ground-truth labels. These two metrics are utilized for measuring the discrimination of the generated features in the feature space.
3) Image Retrieval: Image retrieval aims to accurately and effectively find the most semantically similar images in a database given the query images based on their similarities of features. Such similarities are often measured by the Euclidean distance in the feature space. To evaluate the image retrieval performance, we demonstrate the Precision-Recall (PR) curve and calculate the mean average precision (MAP) with the form where Q is the number of ground-truth RS images in the data set that are relevant with respect to the query image, P(r ) denotes the precision for the top r retrieved images, and δ(r ) is an indicator function to specify whether the r th relevant image is truly relevant to the query. For image retrieval, the test sets are exploited for query, and the training sets are the databases to be retrieved. The proposed method is implemented in PyTorch [80]. In this article, we make use of ResNet18 [81] as CNN backbone architecture for feature extraction. Although other models could be also adopted, we utilize ResNet18 for the sake of simplicity, since it offers a reasonable trade-off between complexity and performance, considering the multitask nature of the work. Additionally, the images are resized to 256 × 256 pixels, and data augmentation strategies including 1) RandomGrayscale; 2) ColorJitter; and 3) RandomHorizontalFlip are utilized. The parameters D, σ , k, q and T are set to 128, 0.05, 0.5, 0.7, and 40, respectively. Stochastic gradient descent (SGD) optimizer with the initial learning rate as 0.01 is adopted for optimizing the loss. The learning rate is decayed by 0.5 every 30 epochs. The batch size is 256, and we totally train the CNN models for 100 epochs. For validating the effectiveness of the proposed method, we compare it with several state-of-the-art deep metric learning methods including: 1) D-CNN [42]; 2) Triplet [68]; 3) SNCA [70]; 4) NSL [72]; 5) ArcFace [73]; and 6) MAE [75]. For training all these methods, we fine-tune the associated learning rates (on the corresponding validation sets) while considering the aforementioned parameters to conduct a fair experimental comparison. It is important to note that we use ResNet18 as a feature extractor model for all the considered methods. Besides, we only focus on techniques that can be framed within the scheme of deep metric learning with label noise due to the multitasking nature of our work. All the experiments are performed on one NVIDIA Tesla P100 graphics processing unit (GPU).

1) K NN classification:
Based on the trained CNN models of the considered methods with the noisy data of different noise rates, we conduct K -NN classification (K = 10) experiments on the test sets and show the results in Table I. As for the uniform noise, t-RNSL can achieve the best performances under different noise rates on the two data sets. For RNSL, when η is less than 0.5, it achieves the second-best performances on the two RS data sets. It indicates that the utilized negative Box-Cox transformation indeed mitigates the effects on the learning of CNN models induced by the images with noisy labels. However, when the noise rate is large, e.g., η = 0.7, large amounts of noisy data can still confuse the learning of class prototypes and the CNN models. Under such a case, its truncated version (t-RNSL) can further improve its robustness through the pruning strategy compared with RNSL. Therefore, when η = 0.7, the classification results from t-RNSL outperform RNSL by large margins. Although the losses induced by the images with lower p c i are down-weighted in RNSL, they still make more contributions to the learning of class prototypes and the trained CNN models cannot accurately capture the geometry structures of the features in the feature space. Although MAE is also a robust loss for classification with noisy labels, its gradients are treated equally with respect to both easily classified or hard-classified images. However, this effect will not be beneficial for capturing discriminative features of RS scenes with complex semantics, i.e., hardpositives. Therefore, MAE can be robust to the different noise levels, while the overall performance cannot be comparable with respect to t-RNSL. Considering the other losses except ArcFace, the classification performances significantly degrade as the noise level increases. For ArcFace, the angular margin parameter is utilized for enforcing the compactness of the intraclass features. It also demonstrates the robustness against noisy labels compared with the other baseline losses. As for the label-dependent noise, t-RNSL can achieve comparable classification accuracies with respect to the state-of-the-art method, e.g., D-CNN or Triplet, when the labels are corrupted with different noise rates. To this end, for both two types of label noise with different noise rates, the proposed method can achieve the superior K -NN classification performances compared with the other state-of-the-art methods.
2) Clustering: Tables II and III report the NMI and ACC scores of the considered methods carried out on the two benchmark data sets with two types of noise at different noise rates. For uniform noise, consistently with the above analysis, the K -means clustering results based on t-RNSL can reach the best or second-best performances on the degree of the agreement that the produced pseudo-labels can match the associated ground-truth labels. When the labels are corrupted by heavy noise, e.g., η = 0.7, t-RNSL can present much higher NMI and ACC scores than most of the other considered methods on the two RS data sets. Moreover, in Fig. 7, we first extract the features of the two test sets via the CNN model trained with labels corrupted by uniform noise (η = 0.5), and project them into the 2-D space by the t-distributed stochastic neighbor embedding (t-SNE). It can be obviously observed that the   intraclass features are more compact and the interclass features are better separated via t-RNSL than the other considered methods. Therefore, the produced features based on t-RNSL can be better clustered through K -means clustering in the feature space, which will lead to higher NMI and ACC scores than the other methods. For label-dependent noise, t-RNSL can also achieve the best performance compared with the other losses when η = 0.1, 0.3 and 0.5.
3) Image Retrieval: Fig. 8 shows the PR curves describing the image retrieval performances of all the methods based on the trained CNN models on the training sets with uniform noise: AID (a) and NWPU-RESISC45 (b), when η = 0.5. It can be observed that t-RNSL exhibits higher precision and recall scores compared to the other considered methods, when the CNN models are trained on the data sets with uniform label noise. Therefore, t-RNSL can be exploited for large-scale RS image retrieval task when the label annotations are not accurate. Table IV displays the MAP scores of the considered methods on the two benchmark data sets with two types of noise at different levels when R = 20. Consistently with the above experiments, for the uniform noise, the image retrieval results obtained by t-RNSL can reach the superior performances with respect to the other losses. When η = 0.1, SNCA can achieve the best retrieval performance, while the proposed losses are slightly lower than it. By stochastically maximizing the leave-one-out K -NN score, SNCA can better discover the inherent neighborhood structure among the images in the feature space than the other class-prototypebased deep metric learning losses, such as NSL. However, t-RNSL can outperform SNCA by large margins when the training data contain uniform label noise with high proportions. As for the label-dependent noise, t-RNSL can preserve the best retrieval performance of all the considered methods when η is less than 0.7. 4) Hyperparameter Analysis: Two main parameters in the proposed losses are q and k, where q controls the power of p c i and k is the threshold from which loss starts to affect the learning of class prototypes w c . We analyze their performance sensitivities based on K -NN classification conducted on the features from the two test sets, extracted via the trained CNN model on the training sets under uniform noise when η = 0.5. For the temperature parameter σ , following the theoretical analysis in [82], 1/σ denotes the radius of the hypersphere which the features are projected on. Larger values of 1/σ can ensure sufficient hyperspace for feature learning with an expected large margin so that the class-wise features can be discriminately separated. Therefore, it is recommended to set this as a relatively small number [82], [83], e.g., σ = 0.05, and we keep it constant for all the experiments in this article. Fig. 9 shows the parameter sensitivity analysis results. It suggests that the proposed losses are favored with a relatively big value of q. For example, for both data sets, the highest performances are observed when q = 0.7. This is consistent with Fig. 4. Larger q will lead to stronger down-weighting effects on the learning of w c based on the hard images. As for k, the effectiveness of the proposed methods are not influenced much with respect to the variations of its values.

C. Discussion
According to our experimental results, we observe that t-RNSL can achieve the best performance on all the three considered tasks (K -NN classification, clustering, and image retrieval) under the deep metric learning scheme when the training labels are corrupted by the uniform loss. The associated performances do not degrade much as the noise rate increases. Compared with RNSL, t-RNSL performs more robustly against the label noise, since a pruning strategy is taken through a truncated loss to avoid the high contributions on learning the class prototypes from the images with wrong labels, especially when the noise rates are high. For practical applications, both RNSL and t-RNSL can be adopted for robustly learning the image features when small amounts of images are wrongly labeled (i.e., η < 0.5). However, when the noise rates are high (i.e., η ≥ 0.5), t-RNSL is recommended to be exploited. As indicated by the t-SNE results, most of the compared methods cannot robustly discover the locality structure for the features in the feature space, as the intraclass features cannot be separated classwisely. As for class-dependent noise, we observe that all the considered methods can perform in a stable manner when the noise level is not high, and the proposed method can achieve the best accuracies. When η = 0.7, D-CNN or Triplet losses can achieve the best performance. The plausible reason is that the D-CNN and Triplet losses are optimized in a relationship-based training mechanism, i.e., pairwise or triplet. When the pairwise or triplet relationships among images are correct, the associated losses can be well-performed on the learning, no matter if the label of the individual image is correct or not. Compared with the relationship-based losses, the performance of t-RNSL is slightly lower. As for the hyperparameter setting on t-RNSL, we can set q and k as constant values (i.e., q = 0.7 and k = 0.5) for all the experiments with different noise rates, so that the parameter tuning will not cost much efforts and the effectiveness of the proposed loss can be preserved. Another parameter T , which triggers the pruning procedure of noisy images, can be set to a constant value, e.g., T = 30, when the noise level is low. As for heavy noise, e.g., η = 0.7, its value can be adapted so that most noisy images do not mislead the learning of class prototypes in the early stages.

V. CONCLUSION
This article presents a novel loss formulation for robust deep metric learning of RS images annotated with label noise. To improve the robustness of the NSL commonly utilized for deep metric learning, we introduce a new RNSL which replaces the logarithm function by the negative Box-Cox transformation in order to down-weight the contributions from the noisy images on the learning of class prototypes. Moreover, by truncating the loss with a certain threshold, the proposed t-RNSL can enforce the learning of class prototypes based on the features with high similarities between them, so that intraclass features can be well grouped and interclass features can be well separated. Compared with several state-of-the-art metric learning losses, the proposed losses can demonstrate better performances on the K -NN classification, clustering and image retrieval conducted on the extracted features through the trained CNN models. In practice, the proposed losses can be utilized for feature learning and image retrieval with large-scale RS data without precise annotations. As future work, we would like to extend the proposed losses to the multilabel case by exploring the use of semantic prototype mixtures. In addition, noise rate estimation is worth to be further studied in future developments.

APPENDIX
In order to simulate the realistic cases of label-dependent noise for RS scenes, we manually create its probability transition matrix Q based on the visual semantic-similarities among different land-use and land-cover classes. For the two benchmark data sets, we flip each label to other labels with similar semantic-contents based on the probabilities shown in Tables V and VI when η = 0.5. Note that, when considering other noise rates, original labels are preserved with a probability of 1 − η and the remaining probability values are proportionally modified.