Unsupervised Remote Sensing Image Retrieval Using Probabilistic Latent Semantic Hashing

Unsupervised hashing methods have attracted considerable attention in large-scale remote sensing (RS) image retrieval, due to their capability for massive data processing with significantly reduced storage and computation. Although existing unsupervised hashing methods are suitable for operational applications, they exhibit limitations when accurately modeling the complex semantic content present in RS images using binary codes (in an unsupervised manner). To address this problem, in this letter, we introduce a novel unsupervised hashing method that takes advantage of the generative nature of probabilistic topic models to encapsulate the hidden semantic patterns of the data into the final binary representation. Specifically, we introduce a new probabilistic latent semantic hashing (pLSH) model to effectively learn the hash codes using three main steps: 1) data grouping, where the input RS archive is clustered into several groups; 2) topic computation, where the pLSH model is used to uncover highly descriptive hidden patterns from each group; and 3) hash code generation, where the data probability distributions are thresholded to generate the final binary codes. Our experimental results, obtained on two benchmark archives, reveal that the proposed method significantly outperforms state-of-the-art unsupervised hashing methods.


I. INTRODUCTION
T HE fast development of satellite technologies has resulted in the availability of massive remote sensing (RS) image archives, which calls for efficient and effective strategies for image search and retrieval. Traditional methods often exploit exact nearest neighbor search approaches that exhaustively compare the query image with each image in the archive.
This approach, which is also called exhaustive linear scan, is time-consuming and, thus, inappropriate for large-scale image search and retrieval problems. To overcome this issue, approximate nearest neighbor search strategies based on hashing techniques have been recently proven to be effective in order to reduce the cost of content-based image retrieval (CBIR) in terms of both processing time and storage requirements [1]. Hashing methods aim at learning hash functions that map the original high-dimensional image descriptors into low-dimensional binary codes, such that the similarity within the original image feature space can be well-preserved.
Hashing methods can be divided into two main categories: 1) data-independent hashing methods and 2) datadependent (also known as learning-based) hashing methods. Data-independent methods, such as locality-sensitive hashing (LSH) [2], define hash functions by random projections that guarantee a high probability of collision for similar input images. Thus, they remain unaware of the data distribution and require long codes to achieve a high retrieval performance [1], [3]. Data-dependent hashing methods can learn more compact binary codes by utilizing a set of data samples from the considered archive and can be roughly divided into two subcategories. The first one includes supervised hashing methods, in which supervised information (i.e., annotations of images) is necessary for learning the hash functions. For instance, Demir and Bruzzone [1] presented, adapted to RS data properties, and tested a supervised kernel-based hashing method, while Li et al. [4] introduced a deep hashing neural network (DHNN) to address CBIR in RS. The DHNN jointly learns semantically accurate deep image features and binary hash codes by employing a high number of annotated images. Due to the use of such annotated images, supervised methods produce discriminative hash codes that satisfy the requirement of semantic similarity between the images. However, it is time-consuming and expensive to obtain a sufficient number of high-quality annotated RS images, particularly for largescale CBIR problems. The second subcategory comprises unsupervised hashing methods that do not require annotated images for learning hash codes.
In this letter, we focus on learning-based unsupervised hashing methods due to their relevance for operational RS image retrieval scenarios. In the RS community, there are few unsupervised hashing strategies available. As an example, the kernelized unsupervised locality-sensitive hashing (KULSH) [5] that defines hash functions for high-dimensional nonlinearly separable RS image descriptors was adopted for RS-based CBIR problems in [1]. The KULSH is defined based on the LSH, formulating the random projections in the kernel space by using a small set of images from the 1545-598X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
considered archive. Li and Ren [3] presented the partial randomness hashing (PRH) method that uses random projections to produce an initial estimation of the hash codes and then learns a linear model to reproject these codes onto the original feature space. Finally, the transpose of the projection matrix is used to generate the binary codes. Reato et al. [6] proposed a multicode hashing method that initially characterizes the images by using descriptors of primitive-sensitive clusters and then constructs the multihash codes from these descriptors using the KULSH. It is worth noting that, unlike in the RS community, in the computer vision and multimedia communities, the use of hashing is more extended and widely studied. The anchor graphs' hashing (AGH) [7], the isotropic hashing (IsoH) method [8], the compressed hashing (CH) [9], the harmonious hashing (HamH) [10], and the density sensitive hashing (DSH) [11] methods are examples of widely used unsupervised hashing methods in that context. Although the above-mentioned unsupervised hashing methods are relevant for operational applications, the hash codes produced by these algorithms might be not discriminative and descriptive enough to model the high-level semantic content of images under complex RS image retrieval tasks. To address this problem, this letter presents a new unsupervised hashing technique based on probabilistic topic models. These models [12] have been recently used to provide RS imagery with a higher level of semantic understanding [13]. This letter takes advantage of the generative nature of probabilistic topic models to produce highly descriptive binary codes using the latent semantic patterns pervading the RS data. Specifically, we define a new topic model called probabilistic latent semantic hashing (pLSH) that learns the binary representation of RS images with complex semantic content. Our experiments on two benchmark archives demonstrate that the proposed method outperforms state-of-the-art unsupervised hashing methods presented in both the RS and computer vision communities.
II. PROBABILISTIC LATENT SEMANTIC HASHING Let X = {X 1 , . . . , X N } be a complete RS archive with N images that are characterized according to a specific K -dimensional original feature space, i.e., is the hash code of X i . Given a query image X q , the objective is to find the most similar images from the archive (using their hash codes) in order to reduce search time and storage requirements. The proposed unsupervised hashing method (see Fig. 1) is defined based on the three steps described in Sections II-A-II-C.

A. Data Grouping
This step aims at dividing the RS archive into different groups, according to the similarities between the images in the original feature space. In order to achieve this goal, we cluster the initial RS image archive X into G disjoint groups, such that X = ∪ G i=1 C i . On the one hand, this approach allows us to process the whole archive using a minibatch scheme, where the number of images that are simultaneously handled can be substantially reduced. Note that this scheme provides important advantages in operational RS frameworks. On the other hand, it also pursues to detect certain data similarities in the original feature space to make the topics highly specialized in terms of the hidden feature patterns associated with each group. Note that each group contains similar images in the original feature space, and thus, the extracted topics can accurately characterize the semantic differences among the potentially ambiguous images. In this letter, we use the wellknown k-means clustering for grouping the images of the archive, whereas any clustering algorithm could be exploited.

B. Topic Computation
The aim of this step is to extract the hidden feature patterns (topics) of each image group and to represent the whole archive into a set of uncovered topics. To this end, we introduce the pLSH topic model [see Fig. 1(b)]. This model has been designed to sequentially extract Z = [L/G] topics from each group as * = { * 1 , . . . , * L } and to represent the images into these topics as = { 1 , . . . , N }. The pLSH consists of three observable random variables d, z * , and w, one hidden random variable z and one regularization parameter δ. In detail, d represents the images of a particular group, z * denotes the topics extracted from the previous groups, w symbolizes the features of the original feature space, and z is the set of topics extracted from the current group. In addition, four direct connections relate images to topics and topics to features. Unlike the regular probabilistic latent semantic analysis (pLSA) [12], our model design is able to simultaneously express the whole RS image archive in terms of two different sets of hidden patterns, given by the diverging random variables z * and z. In addition, the δ regularizer also promotes those topics that exhibit a significant contribution within the RS archive. Note that these two properties are key factors in hashing since they allow us to disregard redundant feature patterns while encapsulating the most relevant semantic content through a sequential processing scheme.
Our pLSH estimates two conditional probability distributions: 1) θ ∼ {p(w|z)} that represents the description of topics in features and 2) λ ∼ {p(z * |d), p(z|d)} that denotes the description of images in observed and hidden topics. In this letter, we estimate the θ and λ distributions by maximizing the complete log-likelihood via the expectationmaximization (EM) algorithm [12]. Initially, we define the log-likelihood expression according to the z * and z random variables. Then, we apply Jensen's inequality and insert three different Lagrange multipliers: 1) two for maintaining the θ and λ within the probability simplex and 2) another one for maximizing the Kullback-Leibler divergence between λ and the uniform distribution, weighted by δ. Note that the parameter δ acts as a sparsity regularizer to enhance the dominant topics in λ because the smallest probability values in the topic space logically become uninformative in the context of a binary characterization. Finally, we compute the partial derivatives, set them to zero, and isolate the model conditional probability distributions to obtain the final expressions for the E-step [see (1) and (2) p(z * |w, d) p(w|z) p(z * |d) where n(w k , d n ) represents the number of times that the feature w k appears in the image d n according to the original feature space. Note that, for the sake of simplicity, we omit, in (1)-(5), the summation indices of the random variables. Given an image group C i in n(w, d), the number of topics per group Z , a sparsity factor δ, and the set of previously extracted topics * in p(w|z * ), the EM process is performed according to Algorithm 1 as follows. Initially, p(w|z), p(z * |d), and p(z|d) are randomly initialized. Then, the E-step [see (1) and (2) This EM optimization is embedded into Algorithm 2 to extract the topics of the complete archive in * ∈ R L×K and to represent all the images into this topic space in ∈ R N×L . From the image archive X (divided in G groups), the number of hash bits L, and the sparsity factor δ, Algorithm 2 sequentially learns the set of observable topics (lines 2-6) and the representation of the archive into these topics (lines 7-11). Note that * is set to the void distribution for the first image group (line 1). In addition, n(w|d) and p(w|z * ) are fixed to the original feature representation of C i and the distribution of previously observed topics * (lines 3 and 8). Finally, p(w|z) is fixed to the zero distribution in the second loop of Algorithm 2 (line 9) to allow representing the whole image archive in the complete set of extracted topics. Note that the complexity of the topic computation algorithm is O(I K N Z). Step input: X, L, δ output:

C. Hash Code Generation
The aim of this step is to generate the final binary codes of the archive as represents the binary code of X i . In order to achieve this objective, the previously uncovered probability distributions = { 1 , . . . , N }, where i = { 1 i , . . . , L i }, are thresholded as follows. Initially, we associate each topic to a particular hash bit (since the total number of extracted topics is L). Then, we compute the marginal probabilities of all the topics in * , as shown in (6). Finally, we obtain the hash code for a given image X i by thresholding each element of its corresponding i distribution according to the piecewise function defined in (7). The target of this function is to transfer the most discriminating semantic patterns of X i to its final binary characterization. Therefore, H only activates the hash bits associated with those topics that exhibit a significant contribution in X i with respect to their occurrence frequency

III. DATA SET DESCRIPTION AND EXPERIMENTAL DESIGN
In this letter, the UCMerced [14] and the EuroSAT [15] data sets have been utilized because they are two important benchmark RS archives. On the one hand, the UCMerced archive contains 2100 RGB aerial images with a size of 256 × 256 pixels and a spatial resolution of 0.3 m. To evaluate the performance of the proposed method in the UCMerced archive, we have used the multilabel annotations of each image available at http://bigearth.eu/data sets. These annotations include 17 different semantic classes. Each UCMerced image is associated with a number of labels that vary between 1 and 7. To characterize the UCMerced archive, we have used a bag-of-visual-words (BOVW) representation of the local invariant features extracted by the scale-invariant feature transform (SIFT). To obtain the BOVW representation of images, initially, the images have been converted to gray scale. Then, the SIFT descriptor of each image is obtained. Subsequently, the k-means clustering with k = 512 has been applied to 100 000 randomly selected SIFT descriptors. Finally, each image has been encoded as a histogram of visual words, normalized by the L2-norm.
On the other hand, the EuroSAT archive includes 27 000 Sentinel-2 images with a size of 64 × 64 pixels. In this letter, we have used the RGB bands that have a spatial resolution of 10 m. The retrieval assessment in the EuroSAT has been conducted using the single-label annotations available at https://github.com/phelber/EuroSAT, which comprise ten classes. The number of samples per class in this archive varies from 2000 to 3000. To characterize the EuroSAT archive, we have used the deep features extracted by the pretrained ResNet-18 convolutional neural network [15]. Specifically, the images have been initially scaled to 224 × 224 × 3. Then, the ResNet-18 feature maps have been extracted and normalized by the softmax function to characterize each input image as a 1 × 512 feature vector. Note that the input feature space is independent of the presented unsupervised hashing approach.
The proposed method has been compared with seven state-of-the-art unsupervised hashing methods: the AGH [7], the CH [9], the DSH [11], the HamH [10], the IsoH [8], the KULSH [5], and the LSH [2]. These methods have been selected because they are relevant single-hash-code methods that have been also employed in other related works [1], [3]. The standard pLSA [12] has been also included as a baseline. The experimental parameters have been set according to the suggestions made in the corresponding articles. In the case of the proposed method, we have considered a general configuration with Z = 8, δ = 1/Z 2 , and I = 100. The retrieval results are provided in terms of the average precision and recall metrics, obtained when considering the top-20 retrieved images in both archives. For the UCMerced, we select each image from the archive as a query, whereas 100 random image queries per class are selected for the EuroSAT. Moreover, five Monte Carlo runs have been conducted to obtain the average and standard deviation results, and two different codelength values have been tested as L = {16, 32}. Table I reports the multilabel average precision and the average recall scores obtained by the proposed pLSH and the compared state-of-the-art unsupervised hashing methods for the UCMerced archive with L = 16 and L = 32. From Table I, one can see that the proposed pLSH provides the highest precision and recall compared to the other methods under both hash bits. In addition, the IsoH and the HamH achieve the second-and third-best average performances, respectively. When L = 16, the improvement of the pLSH with respect to  I   AVERAGE PRECISION, RECALL, AND TIME OF THE PROPOSED METHOD  AND SOME STATE-OF-THE-ART COMPETITORS FOR THE  UCMERCED  the second-best method is 6.24% in terms of average precision and 6.64% in terms of average recall. When L = 32, the pLSH outperforms the second-best method by 3.51% in terms of precision and 2.05% in terms of recall. Analyzing Table I in more detail, one can also see that the standard deviations associated with the pLSH are always among the three lowest values, being all of them below 0.5%. These results show that the proposed method provides competitive advantages with respect to other state-of-the-art methods in the experiments with the UCMerced archive. Regarding the retrieval results of the methods used for comparison, the IsoH obtains (on average) the second-best performance, followed by the HamH, LSH, CH, KULSH, DSH, and AGH. As an illustration, it is possible to see, in Table I, that the performance improvement of the IsoH with respect to the HamH is 1.92% in terms of precision and 1.35% in terms recall (when L = 32). The average processing times of the three best methods (pLSH, IsoH, and HamH) are 4.14, 0.88, and 1.62 s, respectively. Fig. 2 shows an example of the images retrieved by the two best hashing methods that are the proposed pLSH and the IsoH for the UCMerced archive. Specifically, Fig. 2(a) shows the selected query image, while Fig. 2(b) and (c) shows the top-5 retrieved samples of the IsoH and pLSH, respectively. Note that the retrieval order is given above the corresponding  II   AVERAGE PRECISION, RECALL, AND TIME OF THE PROPOSED  METHOD AND SOME STATE-OF-THE-ART COMPETITORS FOR  THE EUROSAT ARCHIVE images, and the land-cover class labels associated with each image are given below such images. By analyzing these visual results, one can observe that the proposed method is able to retrieve images that are semantically more similar to the query. For instance, the fourth-and fifth-retrieved images by the pLSH contain cars, grass, and pavement are all present in the query image, whereas those images retrieved by the IsoH contain buildings, which is unrelated to the concept of the query. From Fig. 2, we can also observe that the proposed method tends to retrieve images containing land-cover classes that are more closely related to the query. As an example, the third-retrieved image by the pLSH contains the bare-soil and tree classes that are not present in the query image but are certainly much related to the query classes: pavement and grass. The same behavior is observed in the retrieval results of many other query images.

B. Results: EuroSAT
Table II provides the single-label average precision and the average recall obtained in our experiments with the EuroSAT archive. From Table I, one can see that the pLSH achieves the highest precision and recall scores with respect to the other hashing methods. In detail, the IsoH and the CH obtain the second-and third-best performances, respectively. The improvement provided by the pLSH with respect to the IsoH is 9.07% in terms of precision and 0.07% in terms of recall. In addition, the pLSH outperforms the CH by 16.19% in terms of precision and 0.12% in terms of recall. The average time of the three best methods (pLSH, IsoH, and CH) is 13.85, 3.36, and 9.65 s, respectively. The results confirm that the proposed method provides significant advantages for RS CBIR problems. Nevertheless, the considered EM-based optimization is a computationally demanding process, and further research could be developed in this regard.
At this point, it is worth noting that the semantic intricacy of the complex semantic content present in RS images using binary hash codes becomes a major challenge in operational retrieval applications. In this regard, the obtained results show that the proposed method is suitable for operational RS image retrieval scenarios, where the images are expected to contain highly complex semantic content. Many existing unsupervised hashing methods try to characterize this semantic complexity using some sort of projection or clustering mechanism. For instance, the IsoH learns different projection functions that allow equal variance across the projected space dimensions. In this way, the number of bits used for each projection can be balanced with respect to the data variance in the original feature space. However, our obtained retrieval results show that the IsoH (as well as the other methods used for comparison) may provide a limited retrieval performance in RS problems since they are unable to effectively extract and exploit the hidden relationships among different feature patterns given in RS imagery. On the contrary, the proposed method aims at enhancing the semantic information carried by each single hash bit by taking advantage of the generative semantic nature of probabilistic topic models. The obtained quantitative and qualitative results demonstrate the effectiveness of our newly developed unsupervised hashing approach, demonstrating its relevance for operational RS image retrieval scenarios.
V. CONCLUSION In this letter, a novel unsupervised hashing method based on topic models has been presented for large-scale CBIR problems using RS imagery. Taking advantage of the generative semantic nature of probabilistic topic models, the proposed method defines a new model (pLSH) to learn the binary hash codes of images in large archives in a fully unsupervised manner. The proposed method enables detailed modeling of the semantic content of an image without requiring annotated images. Our experiments demonstrate the potential of topic models to extract and exploit semantic information present in RS images through binary codes. In the future, we plan to extend this letter to deep probabilistic models with additional experiments and efficient parallel implementations.