RSS Fingerprinting Dataset Size Reduction Using Feature-Wise Adaptive k-Means Clustering

Modern IoT devices, that include smartphones and wearables, usually have limited resources. They require efficient methods to optimize the use of internal storage, provide computational efficiency, and reduce energy consumption. Device resources should be used appropriately, especially when employed for time-consuming and energy-intensive computations such as positioning or localization. However, reducing computational costs usually degrades the positioning methods. Therefore, the goal of this article is to propose and compare compression mechanisms of the fingerprinting datasets for energy-saving without losing relevant information, by using adaptive k-means clustering. As a result, we achieved a compression ratio of up to 15.97 with a small decrease (1%) in position error.


I. INTRODUCTION AND MOTIVATION
Mass-market wearables are steadfastly developing as one of the many future markets of Internet of Things (IoT) applications. Main characteristics of most wearables are that they are typically power-constrained, size-constrained, and costconstrained devices, integrating several mass-market sensors with various capabilities, ranging from measuring physiological parameters to ensuring low-cost wireless communications and positioning solutions. Many wearables do not have Global Navigation Satellite System (GNSS) chipsets embedded and must perform indoor localization based on non-GNSS sensors, such as Bluetooth Low Energy (BLE), WiFi, ZigBee, Ultra Wide-Band (UWB) chipsets, accelerometers, gyroscopes, and/or barometers [1].
Indoor Positioning Systems (IPSs) based on wearables are attracting the attention of the research community more and more. For instance, Belmonte et al. [1] compared several IPS solutions for smart homes in Ambient Assisted Living (AAL) in terms of cost, scalability, obtrusiveness, connectivity, interoperability, and extensibility. WiFi-based positioning using fingerprinting with Received Signal Strength (RSS) measurements was found to offer the best trade-off in terms of considered criteria. However, energy consumption was not included in their study.
The authors in [2] discussed the most common wireless technologies on wearables and pointed towards BLE as a lower cost solution than WiFi. Wireless positioning aspects were only briefly addressed, which was assumed to come primarily from GNSS chipsets.
Wearable-based positioning with BLE sensors and RSS measurements has also been addressed recently in [3], by using Machine Learning (ML) techniques. A 5-layered Artificial Neural Network (ANN) was 8.5 times faster than k-NN at the expense of decreasing the accuracy of the label-based (classification) positioning system by 5%.
The above-mentioned recent research efforts show the increased interest towards building more accurate, more energy efficient, and robust IPS involving wearable devices. Our paper focuses on fingerprint-based methods and, in particular, on the representation of the RSS values in the radio map. The authors propose a novel compression method to reduce the data storage requirements for the RSS value, while benefiting the IPS positioning accuracy in wearable devices.
The main contributions of this paper are: • A novel approach on applying clustering on RSS datasets. • A new method for data compression utilizing k-means clustering and the substitution of RSS measurements with reduced alphabet representation. • The validation of the method on 16 different RSS datasets • The source code for its implementation offered in open access for the research community. This article is divided into the following sections. Section II gives a general overview of the related work. Section III describes the method developed for fingerprinting data compression. Section IV presents the experiments and results. Finally, Section V provides the main conclusions of this work.

II. RELATED WORK
Given that computing efficiency, dimensionality reduction, and data compression are highly demanded in IPS, different authors have proposed multiple methods based on clustering [4], radio-map reduction [5], [6] and other complex-search algorithms [7]. These methods may be executed in dedicated servers, smartphones, and even in low profile devices. Additionally, the combination of IoT, ML, and wearable devices requires efficient algorithms in order to reduce energy consumption [8].
WiFi fingerprinting is commonly used for indoor positioning due to the fact that WiFi is widely deployed in multiple environments (indoor and outdoor). However, this method requires having one or more datasets (radio maps) that are necessary to estimate the user's position. In many cases, they are large datasets with thousands of samples which are not suitable for some IoT devices. Moreover, computing the distances to all fingerprints in the radio map might be too inefficient [9], especially in large operational areas.
To reduce the size of WiFi radio maps, some authors have proposed the dimensionality reduction. For instance, Abed et al. [10] exploit the feature of some APs to transmit more than one signal (Multiple Service Set Identifiers, MSSID), and they propose a new dimensionality-reduction technique based on it. Their objective is to identify the most relevant APs, improve computational efficiency, and reduce the effects of multipath propagation. Their approach is divided into two phases -online and offline phase. The offline phase is devoted to combining the MSSID vectors for each AP reducing the multipath effect. Additionally, in this step, a location based clustering process groups samples in small zones. In the online phase, the operational fingerprint is compared with the centroids of each cluster to find the most similar one. Finally, the author estimates the position by using the k-nearest neighbour algorithm (or k-NN) with the reference fingerprints falling into the selected cluster.
Other researchers are focused on the minimal description length of the data (data compression), for instance, by using Symbolic Aggregate ApproXimation (SAX) [11], [12], XORbased compression, simple 8-b, etc [13]. These algorithms also provide computational efficiency. For example, Baldini et al. [14] study the use of SAX approach in RF Fingerprinting, demonstrating that this algorithm is more computationally efficient by reducing the execution time to 30% compared to the original time series.
Doan et al. [15] proposed a new framework based on lossless compression in order to provide efficient data storage and data indexing. This framework was divided into six blocks which are: data encoding, splitting, zigzag encoding, bit conversion, aggregation, and padding aggregate record. As a result, they saved 97% of the storage space, which is almost 3% more than the other techniques used for data compression.
Azar et al. [16] studied the effects of using lossy data compression techniques on time series by using deep learning. Their main approach is the combination of error-bound compressor (Squeeze) and Discrete Wave Transform (DWT) lifting scheme obtaining a high data compression ratio.
Based on the related work, it is obvious that there are benefits of applying data compression or dimensionality reduction over the datasets. In most cases, they provide high computational efficiency while extending the life time of IoT devices. However, when we use these techniques, the compressed data cannot be restored to its original form, and therefore, some amount of data is lost in the compression or dimensionality reduction process. This may lead to the decreased positioning accuracy of IPS.
The approach of utilizing clustering for the purpose of data compression or improving the performance of the system was explored in the past, e.g., in [17]. Traditionally, clustering on the fingerprinting data is realized by finding similarities in the fingerprints across all features and assigning a cluster to each fingerprint sample. The assigned clusters are then utilized to speed up the process of localizing the user by faster finding similar fingerprints. This paper explores the utilization of clustering on each measurement separately, thus substituting the actual measured RSS value with the cluster index. As the result, the size of the whole dataset is significantly reduced without reducing the amount of measurements and without the significant degradation of the dataset quality for localization purposes. Additionally, the proposed method is able to operate online, efficiently shifting cluster centroids with each newly measured fingerprint.

III. PROPOSED DATA COMPRESSION ALGORITHM
The symbols and notations used in this paper are captured in Table I.  The method proposed in this paper aims to reduce data storage requirements, and it is based on clustering. Unsupervised learning method k-means was targeted due to its low complexity and good viability to find patterns in the non-complex data included in RF fingerprinting datasets. The novelty of our method comes from the fact that the proposed method reduces storage requirements by substituting the measured values with a reduced "alphabet", which represents the centroids assigned to each feature of each sample. This is different from the traditional approaches which typically reduce the number of features in the dataset, using principal component analysis (PCA), autoencoders (AEs), or other dimensionality reduction machine learning approaches.
This section includes a short introduction of k-means clustering, followed by the description of the proposed model for data compression.
Lloyd's algorithm, or k-means, is the most commonly and frequently used clustering algorithm worldwide [18]. There, each cluster is represented only using the coordinates of its centroid. The method requires the initial dataset, which is clustered in two repeating steps similar to expectationmaximization algorithm used in more complex, stochastic methods. The algorithm is initiated by selecting the initial centroids of the clusters, either at random, using given coordinates or using e.g. k-means++ algorithm [19]. In the first step, each sample is assigned to the nearest cluster centroid, based on the chosen distance metric, usually Euclidean. The second step consists of shifting each centroid's coordinates to better represent the assigned data, as the mean coordinate of the assigned samples. These two steps are repeated until the samples no longer change the assigned clusters, or until a maximum iteration is reached.
In the first stage our proposed method applies k-means with k-means++ initiation [19] on the referenced dataset, namely the radio map with access point (AP) measurements. In the second stage, which is designed to operate online, new samples are assigned to the existing clusters and the cluster centroids are adjusted based on the new sample coordinates. Because of that, the clusters always represent the whole dataset at the given time and adjust accordingly with each new sample. The following paragraphs describe the proposed approach.

A. First Stage
The data in the fingerprinting dataset consist of individual samples. Each sample consists of a feature vector (power level measurements from considered APs), and a target vector (a set of values, usually spatial coordinates). A feature refers to power level measurement from a single, specific AP across all samples. In here, we assume that the data is stored per AP. Alternatively, the data can be also stored per each measurement point [17].
At the beginning of the first stage, the multiple features are either merged, if they represent the same physical entity, e.g. RSS measurements from separate APs or antennas, or are kept separate if they represent differing attributes, for example, time of arrival and angle of arrival. The merged features then share the same cluster centroids. In this paper, we consider datasets with only RSS measurements. Under the assumption of equivalent APs, all features can be merged for the clustering.
Next, the number of clusters is calculated from the initial RSS data. In this paper, we calculate the number of clusters for each dataset (or for each group of features in case of more than one group of merged features) by linearising the two dimensions of the radio map (samples and features) and applying formula shown in Eq. 1 to all RSS single values.
Where K refers to the number of clusters, X(:) refers to the whole radio map reshaped into a single vector and unique() is a function finding the number of unique values in its input. Based on the required compression ratio (CR) and tolerance of the method, the number of clusters can be adjusted. The k-means clustering is then applied to the dataset, resulting in each feature of each sample (every single measurement) being assigned to a single cluster. The dataset is then stored as the set of cluster indexes, instead of the measured values.
Along with the clustered dataset, the centroid coordinates and the number of RSS measurements in each cluster are recorded and stored. Due to the limited number of clusters, each dataset entry can be stored using a significantly reduced number of bits, as shown in Eq. 2, instead of using e.g. 64 bits, as is the standard for double floating point format. The table that converts cluster indexes to coordinates and the array with the number of measurements assigned to each cluster must be stored as the necessary overhead.
Where n bits refers to the minimum number of bits required to store one feature of the sample and ceil function rounds a number up to the nearest integer, if necessary.

B. Second Stage
In the first stage, the initial dataset was clustered and compressed. The second stage of the method is fed with a set of independent fingerprints not used in the first stage, e.g. samples from testing set or newly measured ones. In the second stage of the algorithm, the new sample is obtained and processed, to be added to the existing dataset. After the assignment of the features to clusters based on the given distance metric, the sample is added to the existing compressed dataset. Next, each cluster centroid and its count, assigned to the sample are updated as shown in Eq. 3 and Eq. 4.
where i and i + 1 refer to the current and the following iteration, respectively, C n refers to the n th cluster's centroid coordinate, N n refers to the n th cluster's count in the dataset and s t m refers to the t th sample's value of the m th feature. In other words, each new sample's feature updates its assigned cluster's centroid coordinate based on its distance from the last centroid coordinate and number of features assigned to that cluster.
The centroid updates enable the dataset to best represent all the assigned data, rather than only the initial dataset, as would be the case without the updates. For computational efficiency, the updates can be performed in batches, instead of with each sample. The algorithmic description is described in Algorithms 1 and 2, whereas the workflow is depicted in Fig. 1.   To summarize, the proposed method enables the efficient dataset compression using a reduced "alphabet" of values, without reducing the number of samples or features themselves. The method considers the features representing the same physical entities together, reducing the required overhead of the conversion table. The trade-off between the degree of compression and data distortion due to the compression can be adjusted by increasing or reducing the number of clusters.

IV. EXPERIMENTS AND RESULTS
Nowadays, it is important to fulfill three main considerations which are repeatability, replicability, and reproducibility. They are essential in the research area, while repeatability is also mentioned in the ISO/IEC 18305:2016 [20]. In this section, we provide all of the information required to reproduce the experiment. Also, the source code is available online on Zenodo [21].
The datasets consist of Wi-Fi RSS measurements in dBm in different environments. Each dataset is separated into training and testing dataset. For the purposes of this work, the training dataset was used for the first stage including kmeans clustering. The testing dataset served as the source of individual samples for the second stage of the algorithm. Additionally, all datasets include position references for each sample, containing the coordinates, building and floor indexes.

B. Evaluation metrics
To evaluate the performance and viability of the proposed method, the following metrics are considered. First, the mean squared error (MSE) between the original dataset samples and the recovered dataset was calculated in two instances. MSE S1 evaluates the MSE between the original and recovered data from the initial dataset after the first stage. MSE S2 evaluates the MSE between the original of the testing dataset and its recovered version after the second stage of the algorithm. Second, δ M SE , representing the difference between MSE S1 and MSE S2 as shown in Eq. 5, is utilized to evaluate the capability of the method to adapt to the new data.
The impact of the compression on the data quality and the amount of information it contains was evaluated by comparing the positioning accuracy based on the k-Nearest Neighbor (k-NN) classifier. Each dataset was evaluated using 10-Nearest Neighbour (10-NN) classifier both before and after compression, and the mean 3D positioning error ratio ξ KN N before and after compression was calculated, as shown in Eq. 6. The same classifier (i.e., 10-NN) was used to evaluate each dataset under the same conditions and hyperparameters. Finding the optimal value for the number of considered neighbors for each dataset is outside of the scope of this paper, as the classifier only compares the performance of the uncompressed and the compressed data.
Where M SE original refers to the mean positioning error of the original dataset and M SE reconstructed refers to the mean positioning error of the recovered dataset, using the 10-NN classifier after the second stage. 3D positioning error ratio larger than 1 represents the increase of the positioning error, while ξ KN N lower than 1 means the positioning error decreased due to the compression. Finally, the compression ratio (CR) of the method was evaluated to reflect the efficiency of the proposed model to reduce the storage requirements of the method. It was calculated as the ratio between the original dataset size and its reduced size using optimum coding (see Eq. 2). Smartphones usually provide quantized RSS values within range [−105, . . . , −30] dBm, which can be represented with 7 bits. In case of RSS post-processing, such as averaging the measurements over a specified area in datasets TUT 1, TUT 2, TUT 5 and MAN 2, the resulting RSS are stored in double (64 bits) format.

C. Numerical results
In our experiment, k-means clustering was independently executed over 16 selected datasets to create the new alphabets. We used Eq. 1 to set the value of k and the remaining k-means' hyperparameters include, for all datasets, Euclidean distance metric, a maximum number of 100 iterations, 100 replicates and the initialization proposed in k-means++ [19]. Then, the 10-NN algorithm was executed using original datasets and the reduced ones to evaluate the proposed radio map reduction. The results are reported in Table II.  The table shows the varying MSE S1 and MSE S2 values across the datasets, which are in all cases but two below 1 dB. The results also show that δ M SE is less than -0.015 dB, proving the property of the method to adapt well to new data. ξ KN N is 0.989 on average, which corresponds to 1% decrease in positioning error across the datasets due to compression. The results show that although the compression reduces the number of values in the dataset, the quality of the dataset for positioning purposes actually increases. A CR of 12.55 was achieved across all real-valued datasets (64-bit representation) on average and 2.04 across the integer valued datasets (7-bit representation). The repeating values of compression ratios in integer valued datasets are caused by constant ratio between the original and the reduced bit representation, as the overhead is negligible (e.g. CR of 1.75 is achieved by compressing 7bit values into 4-bit representation). The trade-off between the CR and ξ KN N (as well as all MSE metrics) can be controlled by increasing the number of clusters of the method.

D. Discussion
The authors have presented the proposed dataset compression method and validate its usability on 16 different datasets. In comparison to e.g. SAX, the method does not require the assumption of Gaussian distribution of the data, nor any prior knowledge about the data statistics. However, this fact may lead to underfitting of the dataset by choosing the number of clusters too low, and resulting in significant information loss due to the compression. The method is also vulnerable to changes in the environment, which is the common problem of the fingerprinting datasets, as the changes in sample distributions will lead to the decrease of accuracy. In such cases, the new initial dataset should be created as the original samples from the first dataset do not reflect the reality anymore.
The significant differences in the number of clusters between the datasets are caused mostly by dataset postprocessing of TUT 1, TUT 2, TUT 5 and MAN 2 datasets due to the larger number of unique values in them [22]. It is also worth mentioning, that the datasets LIB 1&2 contain measurements from the same area and device, measured 10 months apart. Despite this, each of the datasets got assigned different number of clusters, probably caused by rounding of the ceil() function in the first stage. Also, the presence of outlier devices providing untrusted RSS values might degrade the IPS. Regulating the automatic selection of the number of clusters will be studied and improved in the future.
The authors acknowledge the lack of the validation set in the datasets, which will be added later as the paper presents the results of the preliminary study.
Future work will also concentrate on thorough comparison of the method with the current state-of-art methods, as well as combining the feature-reduction methods such as PCA or AE with the proposed one, to further increase the compression efficiency without losing positioning accuracy.

V. CONCLUSIONS
This paper explores a novel approach on clustering of RSS datasets for RSS-based indoor positioning on wearables, towards more energy efficient solutions. It introduces a new and efficient method of data compression based on k-means clustering and substitution of RSS measurements with a reduced alphabet representation.
The developed compression method allows for optimizing the storage space used by the WiFi fingerprinting datasets, resulting in reduced computational load on the online phase of this positioning technique. The proposed method achieved significant dataset compression, as well as slightly improved the accuracy of the position estimation (see Section IV B). As a result, the proposed method acquired a CR of 12.55 in the real-valued datasets, 2.04 in the integer-valued datasets, and the positioning error was reduced by 1% on average.
Finally, the paper discusses the shortcomings of the current method, highlighting the challenge of automatic selection for the number of clusters. In future work, this method will be compared with other existing compression methods in order to test the efficiency and robustness of the proposed work.