Rotation-Invariant Deep Embedding for Remote Sensing Images

Endowing convolutional neural networks (CNNs) with the rotation-invariant capability is important for characterizing the semantic contents of remote sensing (RS) images since they do not have typical orientations. Most of the existing deep methods for learning rotation-invariant CNN models are based on the design of proper convolutional or pooling layers, which aims at predicting the correct category labels of the rotated RS images equivalently. However, a few works have focused on learning rotation-invariant embeddings in the framework of deep metric learning for modeling the fine-grained semantic relationships among RS images in the embedding space. To fill this gap, we first propose a rule that the deep embeddings of rotated images should be closer to each other than those of any other images (including the images belonging to the same class). Then, we propose to maximize the joint probability of the leave-one-out image classification and rotational image identification. With the assumption of independence, such optimization leads to the minimization of a novel loss function composed of two terms: 1) a class-discrimination term and 2) a rotation-invariant term. Furthermore, we introduce a penalty parameter that balances these two terms and further propose a final loss to Rotation-invariant Deep embedding for RS images, termed RiDe. Extensive experiments conducted on two benchmark RS datasets validate the effectiveness of the proposed approach and demonstrate its superior performance when compared to other state-of-the-art methods. The codes of this article will be publicly available at https://github.com/jiankang1991/TGRS_RiDe.


I. INTRODUCTION
R EMOTE sensing (RS) images have been widely used in multiple applications related to Earth observation, such as object detection and recognition [1]- [5], land-use or land-cover classification [6]- [12], disaster monitoring, and management of natural resources [13], [14], among others. All these tasks require an accurate characterization of RS scenes, from which semantic concepts should be precisely captured. Therefore, extensive methods for RS scene interpretation have been developed in recent years [15]- [17].
Generally, existing RS scene characterization methods can be categorized into two main groups: 1) handcrafted feature-based methods [18], [19] and 2) data-driven feature-based methods [20]. Compared to the data-driven ones, handcrafted features are mainly constructed by applying color, texture, or histograms of oriented gradients (HOG) descriptors on the RS images [21]- [24]. Although they demonstrate prominent performance in scene interpretation tasks (e.g., classification), it is still possible to improve their performance, especially for RS scenes with complex semantic contents. Data-driven feature-based methods aim at automatically learning or discovering the image descriptors by optimizing certain objective functions based on the training data, e.g., sparse coding [25], [26]. With the rapid development of deep learning methods, convolutional neural networks (CNNs) have been widely exploited for capturing the high-level semantic information of RS scenes in an end-to-end manner [27]. By enforcing intraclass compactness and interclass separation, deep metric learning has been recently adopted for accurately capturing the complex semantics of RS scenes into low-dimensional vectors, termed embeddings [28]- [31]. This approach has been used successfully for RS scene classification and image retrieval tasks. In order to train CNN models under deep metric learning assumptions, most deep metric learning methods require supervised information (such as image annotations) for constructing image pairs or triplets with semantic relationships, where the images sharing the same label are semantically similar and dissimilar images have different labels. Then, loss functions are designed to pull together the intraclass deep embeddings and separate away the ones from different classes. Similar ideas of deep embedding even succeed in 3-D point cloud processing via a dimensionality reduction of point features that originally encoded by a neural network [32]. However, these methods may not discover a fine-grained embedding space for RS scene characterization since grouping intraclass deep embeddings may not be beneficial for accurately modeling their local relationships. Consequently, these may lead to image retrieval systems that cannot accurately rank the retrieved images based on their actual visual semantics with respect to the query. Moreover, unlike other kinds of images, RS scenes do not have typical orientations because they are captured by airborne and spaceborne sensors, that is, the same semantic scene may appear at different geolocations, being the only difference between the orientations.
In the context of characterizing RS scenes via deep metric learning, all the images belonging to the same category are expected to produce similar representations (no matter whether they have been rotated or not). Therefore, the possibility of robustly learning the corresponding deep embeddings is highly beneficial to efficiently exploit these RS image characterizations in different downstream applications (e.g., classification and retrieval). Although equivariance and invariance are certainly two important aspects in this type of representation learning [33], the rotation-invariant alternative is more convenient to guarantee the locality structure among RS images in the metric space. Therefore, CNN models trained based on deep metric losses should be rotation-invariant. In order to solve the above issues, we introduce a novel loss function for learning a hierarchical structure of the deep embedding space (as shown in Fig. 1), which satisfies the following conditions. 1) Rotational Invariance: Given the embedding of a source image, its nearest neighbors should be the embeddings from its rotated images. 2) Class Discrimination: The intraclass embeddings are grouped together and the interclass ones are separated away.
To simultaneously achieve these two conditions, we maximize the joint probability of the leave-one-out image classification and rotational image identification. With the assumption of independence, we develop a new loss function composed of two terms: 1) class discrimination and 2) rotation invariance. To balance these terms, we introduce a penalty parameter and further propose a final loss function to learn Rotation-invariant Deep embeddings for RS images, termed RiDe. To this end, the main contributions of this article can be summarized as follows.
1) To the best of our knowledge, this work is the first one that focuses on analyzing the rotation-invariant capability of CNN models when extracting deep embeddings for RS images. 2) We introduce a novel metric learning loss function, RiDe, which can endow CNN models with both rotation-invariant and class-discrimination capabilities. To achieve rotation-invariant deep embedding, we aim at generating a hierarchical structure in the deep embedding space, which satisfies the following conditions: 1) the intraclass embeddings are grouped and the interclass ones are separated and 2) given the embedding of a source image, its nearest neighbors should be the embeddings from its rotated images.
3) Our newly developed RiDe can be adapted to guide the training of any CNN model in a plug&play manner. 4) Based on our extensive experiments, we conclude that RiDe exhibits prominent performance in the generation of rotation-invariant deep embeddings compared with other state-of-the-art losses. The remainder of this article is organized as follows. Section II introduces some related works. Section III thoroughly describes the proposed rotation-invariant deep embedding for RS images. Section IV presents the conducted experiments and discusses the obtained results. Section V concludes this article with some remarks and hints at plausible future research lines.

A. Deep Learning-Based RS Scene Classification and Retrieval
Recently, deep learning techniques have drawn significant attention in the RS field, and extensive research has been carried out with the goal of characterizing the semantic contents of RS scenes. For example, Zheng et al. [34] exploited pretrained CNN features, multiscale pooling, and Fisher vectors to generate invariant CNN features while enhancing their discriminative capability and proposed a new deep scene classification method. Li et al. [35] presented a multilayer feature fusion method based on different pretrained CNN models for RS scene characterization. To classify complex RS scenes, spatial pyramid pooling (combined with multiscale CNN features) was exploited in [36]. Aiming at modeling the semantic relationships among RS images in the embedding space, deep metric learning has become an important trend to effectively capture the semantic contents of RS scenes. Cheng et al. [29] exploited a pairwise loss as a regularizer, together with the cross-entropy (CE) loss to improve the class-discrimination capability of CNN models. Yan et al. [30] adopted a deep metric learning strategy for reducing the data distribution bias in the embedding space and further proposed a domain-adaptation method for RS scene classification. In order to obtain a robust image retrieval system against variations of RS images. Yun et al. [37] introduced a novel triangular loss function within a coarse-to-fine training framework. Xu et al. [38] developed a sketch-based RS image retrieval (SBRSIR) framework for searching images in a scalable RS database based on hand-drawn sketches. Li et al. [39] proposed a meta learning-based method for few-shot RS scene classification, where the balanced loss (which maximizes the distance between different categories) was developed. For more details about deep learning-based RS scene characterization methods, we refer readers to the comprehensive reviews in [16], [40], and [41].

B. Rotation-Invariant CNNs
Rotation-invariant CNN models aim at equivalently categorizing the original images and their rotated versions. In other words, the inference of the labels based on those models is not sensitive to image rotations. An efficient approach to learn such transformation-equivalent CNNs is data augmentation [42]. Its basic principle is to improve the rotation-invariant capability of CNN models by generating abundant rotated training images. Marcos et al. [43] presented a shallow CNN where the rotational invariance is directly encoded by tying the weights of groups of filters to several rotated versions of the canonical filter in the group. Cheng et al. [44] proposed an effective method to train rotation-invariant and Fisher discriminative CNN models for object detection purposes. Laptev et al. [45] introduced a transformation-invariant pooling operator (Ti-pooling), where a siamese network is first employed to extract features (from multiple rotated images) that are then fed through a pooling to the first fully connected layer. Chen et al. [46] developed a recurrent transformer network (RTN) for learning transformation-invariant regions based on a so-called "transformer mechanism," so that the semantic gap of the subordinate-level feature representations could be reduced. Yang et al. [47] introduced a novel object detection method, termed SCRDet, for effectively detecting small, cluttered and rotated objects in RS images. He et al. [48] presented the skip-connected covariance network (SCCov), which jointly exploits skip connections and covariance pooling to achieve highly representative feature learning. By applying channel attention into group convolution, Chung et al. [49] proposed a rotation-invariant RS scene image retrieval method with group convolutional metric learning. To learn discriminative and invariant features of RS images, Wang et al. [50] adopted a siamese network to transfer the input images (using a finite transformation group, consisting of multiple confounding orthogonal matrices) into a representation space, where an invariant representation can be derived.

C. Novelty and Advantages of the Proposed Method
From data augmentation techniques [42] to transformed convolutional and pooling layers [43], [45], different rotation-invariant CNN-based models have shown to be effective to relieve the large-scale variance problem inherent to the presence of rotated images. However, many of these methods rely on regular CNN classification schemes, where transformed images are projected onto their corresponding label spaces without accounting for the structure of the generated deep embeddings. In this way, rotated RS images belonging to the same class may have different internal characterizations, which eventually constrains the usability of such data representations to downstream applications beyond classification. Although some authors have been able to extend this rotation-invariant scheme to other tasks, e.g., retrieval [49], the invariance process is often implemented as part of the CNN design, which may still limit the model generality due to the need for some sort of pretraining and arbitrary weight sharing schemes that eventually undermine the end-to-end nature of the models.
With the objective of generating more meaningful RS image representations (with higher discrimination and generalization ability), we propose dealing with the rotation-invariant problem from a novel deep metric learning perspective, that is, instead of modifying the CNN model design, we aim at developing a new deep metric learning loss to produce rotation-invariant RS image characterizations, regardless of the considered feature extraction architecture. In this way, our newly developed RiDe pursues to advance the development of rotation-invariant RS scene representations from a loss function perspective.

III. ROTATION-INVARIANT DEEP EMBEDDING (RIDE)
The proposed RS image characterization method consists of two main components: 1) a backbone CNN model for extracting the deep embeddings and 2) a new deep metric learning loss for training such model in a rotation-invariant way. Fig. 2 shows a graphical illustration of the proposed framework. As it is possible to see, the backbone architecture is independent of the proposed loss since it is only used as a feature extractor. Besides, a memory bank mechanism is employed to compute all the elements required by the proposed loss. In the following, we describe all these components in detail.

A. Notations
To achieve the rotation-invariant capability of CNN models, we create a rotation-augmented datasetX based on X , where each image inX is a rotated version of the original image in X , that is where rot90(·) refers to the clockwise 90 • rotation operator.
Taking the NWPU-RESISC45 [15] dataset as an example,  we show some images from the original dataset and their rotation-augmented ones in Fig. 3. As an extension, the category label set ofX is symbolized asỸ C . In addition, we denote another label setỸ R , termed image label set, indicating the original image index that generates the rotated versions inX , i.e.,

B. Neighborhood Component Analysis (NCA)
Neighborhood component analysis (NCA) [51] is a supervised dimension reduction method that maximizes the performance of K-nearest neighbor (KNN) classification in the embedding space. Given a training set: the purpose of NCA is to learn a linear function A, which maps the input data into a new embedding space such that each point is more likely to select the ones sharing the same class as its neighbors. To achieve this, the probability that x i selects x j as its neighbor is defined as Based on this, the probability that x i can be correctly classified is where i = { j |y C i = y C j } is the index set of the training images sharing the same class with x i . The goal of NCA is to maximize the log likelihood that all images can be correctly classified, where the log likelihood is defined as Due to its prominent capability for feature modeling, the embedding space can be better characterized by a CNN model than by a linear projection. Therefore, NCA can be extended to a deep version, the scalable neighborhood components analysis (SNCA) in [52], where the linear mapping A is replaced by a nonlinear mapping based on a CNN model F (·). Moreover, the similarity measurement in the embedding space is realized by the cosine between the normalized embeddings of images x i and x j , i.e., s i j = f T i f j . Thus, p i j is formulated as Then, SNCA aims at minimizing the following loss: In order to stochastically minimize the SNCA loss, a memory bank B is introduced to store the normalized embeddings of the training set serving for such contrastive learning.
C. RiDe Loss 1) Limitations of the SNCA Loss: From (6), it can be observed that an optimal solution can be reached when all the embeddings from the same class are the same, In other words, the normalized embeddings from the same class are perfectly aligned with each other in the embedding space. However, such an ideal case is hard to be achieved in practice due to the complex semantics of RS scenes. It indicates that there may be local structures in a set of images belonging to the same class that is represented by more than one point in the embedding space. On the other hand, as opposed to other kinds of images, RS scenes captured by Earth observation sensors have no typical orientations, which means that any rotated RS image is meaningful in reality. The same scene may exhibit different locations, while the only difference between them is their orientations. Therefore, an embedding space for characterizing RS images should satisfy the following structural condition; given an anchor image in the embedding space, the distances between this image and its rotated versions should be closer than any other images, including the ones belonging to the same class.
Since the SNCA loss only aims at grouping all the images that belong to the same class together in the embedding space, it cannot guarantee such a fine-grained structural condition. This means that the trained CNN models do not possess rotation-invariant capabilities for generating deep embeddings. To address this issue, we develop a new loss function that is rotation-invariant and preserves the class-discrimination capability.
2) Definition of the RiDe Loss: In order to achieve the aforementioned goals, we make use of a rotation-augmented datasetX . Note that the use of an augmented dataset logically brings some computational burden to the training stage. However, it is important to highlight that the asymptotic cost of processingX remains the same (compared with the original dataset) since the increase of the number of samples is limited by the amount of considered rotations, which is a constant value that does not depend on the database size. This situation is not exclusive to the proposed approach, but to any other method using augmented data. Besides, it does not affect the operational exploitation of the proposed model. Given a training set {(x 1 ,ỹ C 1 ,ỹ R 1 ), . . . , (x 4N ,ỹ C 4N ,ỹ R 4N )}, p i j denotes the probability that image x i selects x j as its neighbor, as defined in (5). Likewise, p C i measures the probability that x i is correctly classified where In addition, we define another probability p R i calculating the likelihood of its rotated counterparts lying nearby in the embedding space whereR i = { j |ỹ R i =ỹ R j } is the index set of the rotated training images coming from the same source image. As discussed above, we aim at achieving both class discrimination and rotational invariance when extracting deep embeddings from RS images. Hereby, a joint probability p i (C, R) is introduced, which not only represents the likelihood that the image x i is correctly classified but also that its rotated images are located nearest to itself in the embedding space. To simplify the calculation of such joint probability, we assume that both cases are independent. Therefore, it can be formulated as follows: To maximize such joint probability over the whole training set, we equally minimize the following negative log likelihood: According to the properties of logarithms, (10) can be further expanded as From a loss function perspective, there are two different terms in (11): 1) class discrimination, which is optimized for pulling intraclass embeddings together while pushing interclass embeddings away, and 2) rotational invariance, which is optimized for grouping together the embeddings of the rotated images obtained from the same source image. In order to better balance these two terms, a penalty parameter λ is introduced, and the final RiDe loss is formulated as In fact, RiDe can be considered as a rotation-invariant generalization of SNCA. When λ = 0, RiDe turns into SNCA.

D. Optimization
Based on the backpropagation technique, the gradients of L RiDe with respect to the parameters in CNN models can be obtained. To stochastically minimize L RiDe , we exploit the memory bank B to store all the normalized embeddings of the training images. After each training iteration, B is updated in an empirical weighted averaging manner where m is a parameter controlling the balance between the two embeddings. The associated optimization scheme is described in Algorithm 1.

E. Complexity Analysis
With an embedding size of D and a whole number of rotated images 4N, the memory bank B requires O(DN) of memory. Suppose that the batch size is B. In this case, the similarity metric and the probability density require O(B N) of memory, and the other intermediate variables occupy O(B N) of memory.

5:
Calculate the similarities s i j based on the extracted mini-batch embeddings and those in B. 6: Index the similarities based onỸ C andỸ R . 7: Calculate the RiDe loss in (12). 8: Back-propagate the gradients. 9: Update B via (13). 10: end for Ensure: F (·)

A. Experimental Setup 1) Dataset Configuration:
In our experiments, we use two RS scene benchmark datasets: 1) aerial image dataset (AID) [40] and 2) NWPU-RESISC45 [15]. For additional details about the datasets, we refer the readers to the associated articles. As introduced above, the proposed method exploits rotation-augmented datasets. Thus, we expand the original datasets by adding the rotated versions of each image with 90 • , 180 • , and 270 • and create rotation-augmented (AID-R and NWPU-RESISC45-R) versions of the original datasets. From the original datasets, we first randomly select 70%, 10%, and 20% of the available images for training, validation, and testing, respectively. Then, we associate the rotated versions of each source image into the corresponding sets to construct the splitting of AID-R and NWPU-RESISC45-R. In order to evaluate the effectiveness of the proposed method, we carry out KNN classification and image retrieval tasks based on the extracted deep embeddings. 1) KNN classification aims at classifying the input image based on its KNNs in the embedding space, whose class is dependent on its neighbors' classes via a majority voting. To evaluate the classification performance, we adopt the overall accuracy and confusion matrix. 2) Image retrieval aims at effectively finding the most semantically similar images in a database given a query image, ranking them based on the similarities measured in the embedding space. The evaluation is based on mean average precision (MAP) and Recall@k (R@k). MAP is defined with the form where Q is the number of ground-truth RS images in the database that are relevant with respect to the query image, P(r ) denotes the precision for the top r retrieved images, and δ(r ) is an indicator function to specify whether the r th relevant image is truly relevant to the query. R@k is defined as the percentage of queries having at least one relevant image retrieved among the top k results. Since RiDe aims to generate deep embeddings of images with rotational invariance and class discrimination, we design two different scenarios for our experimental evaluation.
1) Rotated Image Identification: Given the test sets of AID-R and NWPU-RESISC45-R, we utilize the associated image label setỸ R to obtain the ground-truth labels.  Fig. 4.
2) Implementation Details: We utilize ResNet34 and ResNet50 [53] as the CNN backbones to extract the deep embeddings of the input images. The spatial size of the input images is 256 × 256, and they are augmented by: 1) RandomGrayscale; 2) ColorJitter; and 3) RandomHorizon-  talFlip. The parameters D, σ , λ, and m are set to 128, 0.1, 0.1, and 0.5, respectively. We utilize the stochastic gradient descent (SGD) optimizer to train the CNN models with an initial learning rate of 0.1 and a decay rate of 0.5 every 30 epochs. We train the networks for a total of 100 epochs. Due to the limitations of memory in the graphical processing unit (GPU) used in experiments, the batch size is 256 for ResNet34 and 128 for ResNet50. To validate the effectiveness of the proposed method, we compare it to several state-ofthe-art deep embedding methods from two perspectives. 1) Rotated Image Identification: For evaluating the rotation-invariant capability, we compare RiDe to: 1) SNCA [52]; 2) SNCA-aug, where rotation-based data augmentation is used for training the CNN models; and 3) SCCov [48]. 2) Class-Wise Image Discrimination: For validating the preservation of the class-discrimination capability, we compare RiDe to: 1) triplet [54]; 2) normalized softmax loss (NSL) [55]; 3) ArcFace [56]; 4) SCCov [48]; and 5) TI-POOLING [45]. All the experiments are implemented in PyTorch [57] and carried out on an NVIDIA RTX3090 GPU. Table I presents the KNN classification results for the rotated images based on the deep embeddings extracted from the test sets. Since one source image generates four rotated versions, we report KNN classification results with K = 1, 2, and 3. This experiment is intended to analyze whether all the deep embeddings of rotated images are located close to each other in the embedding space. From Table I, it can be observed that with the vanilla ResNets, RiDe achieves the best performance (with a value near 100%) on the two considered benchmark datasets. Compared to SNCA, the inclusion of the proposed rotation-invariant term in RiDe can significantly improve the rotation-invariant capability of the trained CNN models in the generation of deep embeddings. Without this term, the KNN classification performance drops more than 10% in SNCA. Although data augmentation is an efficient approach to improve the rotation-invariant capability, RiDe generally outperforms SNCA-aug by more than 5% in the KNN classification results. With the stateof-the-art CNN architecture (SCCov) for scene classification, the use of RiDe can greatly improve the performance of rotated image identification compared with the CE loss utilized in [48]. It is worth noting that TI-POOLING is not considered in this experiment since all the rotated images have the same embedding produced by TI-POOLING. Thus, the accuracy of TI-POOLING will be 100%. Moreover, TI-POOLING is a kind of rotation-invariant CNN architecture, while the proposed RiDe is a novel loss targeted at learning the embeddings invariant to the rotation of the input images, which can be combined with any CNN architecture. However, one disadvantage of TI-POOLING is that the computational cost will be increased due to the feature aggregation from multiple input images. As shown in Table II, we illustrate the computational cost study of both RiDe and TI-POOLING for both training and inference phases. It can be observed that TI-POOLING will spend more time for learning and extracting deep embeddings than RiDe. Table III shows the image retrieval results evaluated by both MAP and R@k when R = 1, 2, and 3 and k = 1, 2, and 3. Consistently with the KNN classification results, RiDe exhibits superior performance when compared to other methods. It can be seen from the obtained results that all the deep embeddings of the rotated images are located close to each other in the embedding space generated using the proposed method. To visually verify such observation, given some query images, we display their three nearest neighbors retrieved from the test sets of AID-R and NWPU-RESISC45-R in Fig. 5. Without the penalty on learning rotation-invariant deep embeddings, the nearest neighbors retrieved by the SNCA are not always from the same source image (images marked in red color in Fig. 5).

1) Rotated Image Identification:
Although their class labels are the same, their semantics may exhibit large divergences (this can be seen, for instance, in the first row of Fig. 5). By utilizing a data augmentation strategy, SNCA-aug indeed increases the semantic similarities of the nearest neighbors associated with the query compared to SNCA. However, SNCA-aug cannot perfectly retrieve all the rotated images from the test sets (marked in green color in Fig. 5), given the query. In this regard, RiDe ranks all the rotated images with the highest similarity in the database with respect to the input query image. Similar performances can also be observed when SCCov is utilized as the CNN architecture. Therefore, we conclude that RiDe not only guides CNN models to learn rotation-invariant deep embeddings but also models better the semantic similarities among the images.
2) Class-Wise Image Discrimination: Table IV presents the KNN classification results for class-wise image discrimination based on the extracted deep embeddings from the test sets of AID and NWPU-RESISC45 when K = 1, 5, and 10. Compared to the other losses, RiDe achieves the best accuracies with different values of K . When the CNN backbone changes from ResNet34 to a more powerful network, i.e., ResNet50, the other methods improve their classification performance on the NWPU-RESISC45 dataset. For example, Triplet, NSL, and ArcFace exhibit about 1% − −3% increase in accuracy when the network is changed from ResNet34 to ResNet50. On the contrary, the associated accuracy differences for RiDe are less than 1%. This suggests that RiDe can generate high-quality deep embeddings based on both lightweight and powerful CNN architectures. Compared to the state-of-the-art deep embedding method, i.e., ArcFace, RiDe can better uncover the local neighborhood structure of the input images for class-discrimination purposes. Compared with the CE loss exploited in SCCov [48] and TI-POOLING [45], the adoption of RiDe can lead to higher KNN-based classification accuracy. In order to evaluate the image retrieval performance, we calculated the MAP scores based on the deep embeddings extracted from the test sets when R = 20, 50, and 100 and display them in Table V. It can be observed that the semantic relations among the images can be best captured through RiDe, regardless of the exploited CNN architectures, and the retrieval performance can be still preserved as the number of retrieved images increases. Therefore, we conclude that our newly proposed method can be applied for accurately indexing large RS databases.
3  Fig. 6 shows the KNN-based classification results for rotated image identification when K = 1, 2, 3, 4, 5, 6, 7, and 8. It is observed that RiDe can be  generalized to more rotational angles and the nearest neighbor performance is well preserved. 4) Hyperparameter Analysis: Two parameters, λ and σ , should be carefully tuned for achieving good performance using RiDe. λ controls the balance between the class-discrimination and rotation-invariant terms in RiDe, and 1/σ represents the radius of the hypersphere on which the embeddings are projected. With a larger value of 1/σ , the embedding hyperspace can be more appropriate for class discrimination [58], [59]. Taking [52], [60] into account, we empirically set it as a small number, e.g., σ = 0.1, and keep it constant in our experiments. In order to analyze the influence of different values of λ on the performance of RiDe, we conduct the KNN classification for both rotated image identification and class-wise image discrimination when λ is in the range [0.05, 0.1, 0.5, 1]. We adopt the ResNet50 architecture and plot the classification results in Fig. 7. As λ increases, the performance of rotated image identification also improves since a larger penalty is on the rotation-invariant term of RiDe. On the contrary, when λ decreases, more emphasis is given to the class-discrimination term of RiDe, so the performance for class-wise image discrimination gets better. Therefore, to achieve a balanced performance in terms of both rotational invariance and class discrimination, λ should not be too small or too large. In our experiments, we empirically set λ to 0.1 and achieve good performance. 5) Discussion: We conducted extensive experiments from two perspectives (including rotated image identification and class-wise image discrimination) to validate the performance of RiDe. Our experiments are specifically designed to test the performance of the rotation-invariant and class-discrimination capabilities of the trained CNN models. Based on our experimental results, we can observe that RiDe significantly improves the rotation-invariant capability of the trained CNN models since all the rotated images are located nearby each other in the embedding space. Moreover, differently to other rotation-invariant deep learning methods  (e.g., TI-POOLING), RiDe is a loss function designed for learning rotation-invariant image embeddings, which can be applied to any CNN architectures. The experiments also indicate that RiDe can be applied with any rotation angles. Moreover, RiDe also achieves better class-discrimination results than the other losses. This indicates that the class-discrimination capability of the CNN models trained by RiDe can also be preserved. Therefore, RiDe can actually discover the hierarchical structure of the embedding space for RS images, where the nearest neighbors of query images are their rotated versions, the next nearest neighbors are the images from the same class, and the images in different classes are well separated. Such a hierarchical structure of the semantic relationships among the RS images is very important for the downstream task. For example, an image retrieval system requires to accurately and effectively find the most semantically similar images within the database given the query images. However, sometimes semantic similarities cannot be precisely modeled by category labels. Without the fine-grained semantic information required to categorize the images in the same class, the retrieved ranking order with respect to the input query image may not reflect the actual order of semantic similarities. In this case, by better modeling the similarities of rotated images, RiDe exploits an auxiliary task to model the fine-grained structure of the embedding space, which better fulfills the requirement of image retrieval systems.

V. CONCLUSION
In this article, we introduce a new loss function for learning rotation-invariant deep embeddings of RS images. Specifically, we first review the limitations of the SNCA loss function when constructing the hierarchical structure of the images in the embedding space. Instead of just maximizing the leave-one-out classification probability, we introduce a joint probability for correctly classifying each image based on its neighbors and identifying its rotated images as the nearest ones. To maximize such probability, we assume that both cases are independent, which further leads to a loss function composed of two terms: 1) class-discrimination term and 2) rotation-invariant term. To balance these two terms, we introduce a penalty parameter and finally propose the new RiDe loss function. We carry out extensive experiments on two RS benchmark datasets and compare RiDe to other state-of-the-art losses. Our experimental results validate the effectiveness of RiDe in the task of ensuring that CNN models exhibit both rotation-invariant and class-discrimination capabilities. As future work, we plan to reconsider the maximization of joint probabilities into a Bayesian framework, as well as to extend the proposed approach to other kinds of transformations and data modalities.

ACKNOWLEDGMENT
The authors would like to thank the Associate Editor and the two anonymous reviewers for their outstanding comments and suggestions that greatly helped them to improve the technical quality and presentation of their work. Moreover, they display the normalized confusion matrices obtained for the KNN classification via RiDe on the two test sets in Fig. 8. It can be observed that most classes can be correctly discriminated in the two considered datasets. Nevertheless, there are several classes that are misclassified, e.g., Resort and Park in AID.

APPENDIX
In Fig. 8, we demonstrate the normalized confusion matrices for the KNN classification (K = 10) of the test sets of the two benchmark datasets based on the deep embeddings obtained by RiDe.