A Comprehensive and Reproducible Comparison of Clustering and Optimization Rules in Wi-Fi Fingerprinting

—Wi-Fi ﬁngerprinting is a well-known technique used for indoor positioning. It relies on a pattern recognition method that compares the captured operational ﬁngerprint with a set of previously collected reference samples (radio map) using a similarity function. The matching algorithms suffer from a scalability problem in large deployments with a huge density of ﬁngerprints, where the number of reference samples in the radio map is prohibitively large. This paper presents a comprehensive comparative study of existing methods to reduce the complexity and size of the radio map used at the operational stage. Our empirical results show that most of the methods reduce the computational burden at the expense of a degraded accuracy. Among the studied methods, only k -means, afﬁnity propagation, and the rules based on the strongest access point properly balance the positioning accuracy and computational time. In addition to the comparative results, this paper also introduces a new evaluation framework with multiple datasets, aiming at getting more general results and contributing to a better reproducibility of new proposed solutions in the future.


INTRODUCTION
L OCATION information bridges the gap between the physical and the digital worlds, creating new opportunities and challenges. As smart devices and pervasive mobile connectivity are increasingly penetrating our daily lives, more and more applications and location-based service (LBS) are built on the location awareness.
Outdoors, Global Navigation Satellite System (GNSS) technologies successfully provide position estimates, even in low satellite coverage situations, as long as they are combined with other well-known technologies such as inertial sensors, cellular networks, or IEEE 802.11 Wireless LAN (Wi-Fi) [1,2,3]. However, people spend about 80 % of their time indoors [4,5], so indoor positioning and tracking is of high relevance for LBS. The difficulty of achieving a model that fits every indoor environment and which can deal with particularities such as signal multipath and heterogeneity of devices and building structures, makes the indoor location estimation a challenge [6]. Despite that, Wi-Fi fingerprinting (FP) is among the preferred indoor positioning technologies.
• J. Torres-Sospedra, is with UBIK Geospatial Solutions, Spain. E-mail: torres@ubikgs.com • P. Richter is with u-blox, Finland. E-mail: philipp.richter@u-blox.com In contrast to other approaches using Wi-Fi as the main positioning technology (e.g. proximity or ranging), Wi-Fi fingerprinting does not require information about the position of the APs. Wi-Fi fingerprinting relies on a set of fingerprints taken at well-known positions for the position estimation; i.e., Wi-Fi FP requires a reference dataset (or radio map) to operate. Different well-known methods tackle this problem, including the Nearest Neighbour (NN) algorithm k-NN [7], Gaussian kernels [8], Bayesian models [9], Neural Networks [10] and, even, Deep Learning [11,12].
Among the techniques mentioned above, the ones based on advanced Machine Learning (ML) are the most accurate ones [13]. The high accuracy comes at the expense of high complexity; which is often prohibitive for smartphone implementation [14]. The other methods balance better positioning accuracy and computational complexity. Especially k-NN stands out, because it is simple but able to achieve very good positioning performance. For these reasons, k-NN has been widely adopted [7,15,16] -even on international competitions [17,18,19]-and we adopt it as base algorithm in this study. Nevertheless, the computational complexity can be an issue and requires further attention; especially because of the inherent trade-off between positioning accuracy and computational complexity.
Most FP methods (except the advanced ML methods) share the drawback that in the on-line phase the computational costs increase the with number of fingerprints [20]. This operation is the computationally most demanding operation [21], because for each operational fingerprint the distance to all reference fingerprints are calculated. How-ever, the computational load in the on-line phase can be reduced if the distance calculation is restricted to a subset of reference fingerprints relevant to the operational fingerprint.
The approaches to alleviate the computational burden in the operational phase can be divided in two categories, namely clustering and optimization rules. Some of them focus on creating groups (clusters) of similar fingerprints off-line, and then apply a two-level search in the on-line (operational) phase, for instance, using k-Means clustering to split the radio map and create the cluster centroids [22]. For each operational fingerprint, the distances to the centroids are first computed and, then, the distances to all the reference fingerprints in the nearest cluster are calculated to estimate the final position (see Figure 1a). Other works correspond to optimization rules (heuristics) based on signal propagation characteristics, where the distance calculation is restricted to the reference fingerprints that are relevant (see Figure 1b). For instance, keeping the reference fingerprints whose strongest AP matches the operational one [23]. Given the large amount of different approaches, either clustering methods [8,22,24,25] or optimization rules [23,26,27], it would be interesting to know which of the proposed methods achieves the best trade-off between positioning accuracy and computational complexity. Most of the studies focus on the positioning accuracy only. Moreover, these studies are not comparable, as they differ in their experimental setups, the area of evaluation, the localization devices, the strategy to collect data, or they are tailored to specific environments [28]. A few studies address the computational effort [8,27]. Similar issues prevail here, the methodology for evaluating the computational complexity and the used metrics are so different that these methods are not comparable either. Having a comprehensive evaluation framework will enable the research community to get general results and allow for direct results comparisons.
This paper aims to fill this gap and introduces a comparison, through experimental evaluation, of existing clustering and optimization rules for Wi-Fi fingerprinting based on k-NN . Moreover, it also provides the tools to reproduce and extend this work, which might be useful for the research community when evaluating a new method. The major contributions of this paper are: • Identification of the best strategy (clustering or optimization rule) depending on the scenario features.
• Identification of the best general methods to reduce computational costs in fingerprinting.
• Independent evaluation of surveyed methods in terms of positioning error and execution time.
• Evaluation procedure to normalize results and extract general conclusions from different perspectives.
• Supplementary materials for research reproducibility and to allow this work extension.

BACKGROUND AND RELATED WORK
In fingerprinting (FP) positioning systems, the position estimate is frequently computed from a fingerprint representing the RSS from wireless signals, such as Wi-Fi or Bluetooth Low Energy (BLE), at unknown locations and a reference set with previously collected fingerprints. FP methods typically have two well differentiated phases: the off-line phase (also known as training or learning phase) and the on-line phase (also known as operational or test phase).
During the off-line phase, some known locations are selected for system calibration. In each of those locations, also known as Reference Points (RPs), multiple fingerprints are usually collected to capture the inherent temporal diversity of radio signals due to reflections, refraction, diffraction, scattering and interference. However, there is no standard procedure to carry out the off-line phase which depends on the deployment, the developer's strategies and other characteristics of the environment. The data-collection strategy includes the spatial distribution of RPs, the number of repeated measurements (fingerprints) per RP, the height for the sensing device [26], the user orientations [7], the devices and users [29,30], and also the collection times [29,31]. A common notion, though, is that the more dense the dataset, the lower the positioning error [32,33]. However, this is not always the case, as shown by comparing the density values (Table 2) to the position errors (Table 4). For instance, TUT 6 provides the lowest error with low density values, and UJI 1 provides high error with higher density values.
During the on-line phase, one fingerprint collected at an unknown location is further processed using the selected positioning algorithm to compute a position estimate. A FP algorithm only relies on the radio map to estimate the position. The core idea behind FP is that a pair of similar fingerprints are physically close to each other and, therefore, similar samples in the radio map can be used to compute the location of the unknown fingerprint. Selecting the best model at this stage is a crucial step for FP.
Wi-Fi FP was introduced by Bahl et al. [7] in 2000. RADAR was presented as a system for locating and tracking users inside a building using only the RSS traces from a Wi-Fi network; and it had the above-mentioned two-phase procedure. In the off-line phase, multiple fingerprints were collected at each RP and user's orientation. In the on-line phase, the position was estimated using multiple Nearest Neighbour (k-NN). Twenty years later, Wi-Fi FP and methods based on k-NN are still very popular [19,34,35,36].
There is an intrinsic trade-off when generating the radio map, because the accuracy of the FP-based positioning system increases typically with the density of the radio map, i.e. the number of fingerprints per area, and the number of RSS values per fingerprint [37,38]. However, generating the radio map is a demanding task. Some works have applied crowdsourcing [39,40,41,42,43], interpolation [37,44,45], signal-propagation models [34,46] or Simultaneous Localization and Mapping (SLAM) [47,48] to reduce data granularity and manual site surveying. However, the methods that augment the radio map to reduce the positioning error (e.g. applying Universal Krigging to generate a denser radio map as in [37]) also increase the computational costs in the on-line phase as they depend on the radio map size [49].
Although reducing computational costs is not identified as the main objective in previous works, computing the distances to all the reference fingerprints for every location request might be too inefficient [50]. A literature review on FP in Scopus and Web of Science raised many attempts to improve the accuracy of FP methods and reduce their computational load, being the relevant reproducible works implemented in this paper. Some of them apply optimization rules to reduce the radio map in the operational stage, while others apply clustering to group similar fingerprints.
The Horus indoor positioning system [51] addressed this problem in 2003, where a multi-level clustering process was proposed to estimate the position by means of a probabilistic approach. Only the detected APs in the on-line phase were used for computation and the search is restricted to the RPs covered by the strongest AP -the AP with the largest detected RSS value in the operational fingerprint.
Kushki et al. [8] proposed a kernel-based FP system where spatial filtering was done in the on-line phase. The idea behind the filtering is that the Wi-Fi coverage is similar in adjacent locations. First, a coarse estimation is done to get the RPs which have similar coverage (as the number of common APs) as the on-line fingerprint. Then, the reduced radio map only contains the fingerprints of these RPs.
Gallagher et al. [50] investigated a simple approach for radio-map reduction before computing the fingerprint distances [52,53]. It keeps only the RPs which contained the strongest AP of the operational fingerprint. Gallagher et al. proposed some variants where the number of matching APs from the operational fingerprint was higher -without clearly specifying which APs were added-and by filtering by similar RSS values (±15 or ±20 dB).
Shin et al. [22] applied k-Means clustering (renamed in this paper as c-Means to avoid confusion k-NN) to extract and organise spaces from the radio map. However, c-Means has also been used for coordinate-based clustering [54], floor-wise fingerprint clustering [55] and, even, to cluster the positions of the nearest neighbors obtained with k-NN [56].
Marques et al. [26] filtered the radio map based on the notion that fingerprints are dominated by just one or two APs. For an on-line fingerprint, the reduced radio map only contains the reference samples whose strongest AP matches the strongest AP of the operational fingerprint, or the two strongest operational APs, if their RSSs values are close.
Chen et al. [24] and Caso et al. [57] decomposed the radio map in multiple clusters using the Affinity Propagation algorithm and applied the traditional two step, coarse and fine, location strategy. The novelty in [24] was to select the samples from multiple clusters to avoid the edge problem, whereas the novelty in [57] was the adoption of different metrics in different steps of the estimation procedure.
Razavi et al. [55] proposed a floor estimation method based on c-Means clustering. Their method clusters the data of each floor separately, thus relies on a preliminary search of the database's fingerprints according to their floor label. Subsequently, the mean of the clusters of each floor is computed. The reduction of data and communication overhead is achieved because only the cluster heads of each floor are used to estimate the floor.
Yu et al. [27] proposed to filter the radio map on the fly during the on-line stage. First, the non-detected APs in the on-line fingerprint are removed from the radio map. Then, the fingerprints without any detected RSS values are also removed. Finally, the filtered radio map, which contained the reference fingerprints in the same region where the unknown one was collected, is used for the localization.
Moreira et al. [23] proposed some rules to create subsets of the radio map. To estimate the building, the APs from the on-line fingeprint were sorted from the strongest to the weakest. The reference samples whose strongest AP matched the first AP of the sorted list were selected for the reduced radio map. If the reduced radio map was empty, they moved to the next AP in the list. This was repeated until reaching a valid radio map. For the floor estimation, they restrict the radio map to those reference fingerprints where the strongest AP corresponds to the first, second or third strongest AP of the on-line fingerprint.
Chen et al. [25] reduce the radio map by selecting the fingerprint with the strongest RSS for an AP as a cluster center, thus having as many clusters as APs, in an scenario with eight APs. In the operational stage, they used weighted k-NN upon the most similar cluster selection. No guides were given for contexts with larger amounts of APs, let alone for cases where there are more APs than samples.
Liu et al. [58] explored location-based clustering, where clusters are determined using the minimal radius circles that enclose all RPs. The circles are then used to cluster the RP by their distance to the circles' centers. In the operational stage, they used k-NN or a variant of weighted k-NN upon the most similar cluster selection.
The difficulty in comparing the performance of the methods introduced above is that they have not been evaluated using a common comprehensive evaluation setup. Therefore, from the existing literature, it is impossible to compare their relative merit no matter the considered perspective. Table 1 introduces our notations used further to compare and analyse different radio-map FP methods.

Fingerprinting with reduced radio maps
The approaches to reduce the computational costs can be roughly divided into clustering and optimization rules, where the later commonly refers to some kind of thresholding to decide if a fingerprint deemed relevant and therefore is considered or not. Algorithm 1 shows the processing stages of FP with k-NN over a reduced radio map.
The clustering and optimization rules trade-off on-line computation for mostly off-line computation, but as well TABLE 1 Symbols and notation used in the paper.
T radio map, set of fingerprints and associated reference positionŝ T reduced radio map after clustering or filtering,T ⊆ T V Evaluation dataset, set of labelled fingerprints for testing P set of reference positions/labels, p ∈ P | · | cardinality (e.g., |C| number of clusters, |C i | number of samples in i-th cluster) C set of clusters C = {C 1 , ..., C |C| } C i i-th cluster, C i ⊆ T or C i ⊆ P A set of access points, APγ ∈ A δ average number of fingerprints per m 2 in a 5 m radius circle s t fingerprint of radio map or reduced radio map s v evaluation fingerprint s RSS value γ identifier associating sγ to the APγ emitting the signal for j = 1 to |T | do 6: Compute distance between s v i and s t j 7: end for 8: Sort distances in RSS space 9: Select the k closest candidates 10: Estimate building, floor and position 11: end for 12: return Estimated positions, floors, buildings for some additional on-line computation. Both approaches pre-process the training datasets in the off-line phase. That is, training data (i.e., the radio map) is clustered or certain statistics of the training data -frequently thresholds for subsequent filtering -are computed (cf. line 2 in Alg. 1). During the on-line phase, the clustering methods perform a coarse search over the found clusters, to find the closest cluster (cf. line 4 in Alg. 1) and then match the fingerprints of that cluster with the operational fingerprint (cf. line 6 in Alg. 1). The optimization rules compute the respective statistic of the operational fingerprint, filter this fingerprint according to the computed statistic, intersect the filtered operational fingerprint and the filtered training dataset to obtain the reduced training data (cf. line 4 in Alg. 1) and then match the reduced dataset with the operational fingerprint (cf. line 6 in Alg. 1). The estimation of the position equals that of the baseline algorithm without radio map reduction and is common for clustering and optimization rules (cf. lines 8 to 10).

Methods implemented
For the experimental evaluation described here, some of clustering and optimization-rules introduced in Section 2 were implemented, their code is available in [69]. The ones with lack of key implementation details, i.e. not following reproducible research principles, were discarded.

Clustering methods
The method based on Kushki et al. [8] is Kushkix. In Kushkix, x refers to the threshold value to filter the RPs. We have considered threshold values in the range of [1 . . . 15].
The methods based on c-Means clustering [70] are cMeansTrad and cMeansAlt. For the two methods, c refers to number of intended clusters. The particular values that correspond to c = |T | and c = |T | 25 are identified by 'rfp1' and 'rfp2', respectively. cMeansTrad corresponds to the traditional method used in several indoor positioning works [22,55]. cMeansAlt variant uses the Manhattan distance and the centroid initialization proposed in [71].
The methods based on Affinity Propagation clustering [72] are APCSpaTrad and APCSpaAlt. APCSpaTrad uses the sparse implementation provided in [73]. APC-SpaTrad computes the pairwise similarities for distances among all fingerprints. The alternative version considers the pairwise similarities as the distances among all fingerprints that have at least one AP in common, to reduce the memory and computational cost of the clustering stage.
The methods based on grid-based clustering [74] using fix-sized square cells are Gridx and GridOverlx. The gridbased clustering was suggested for RSS-based clustering by Liu et al. [58]. For the two methods, x refers to the size of the cell in meters. GridOverlx adds new cells that uniformly overlap each original set of four neighboring cells.

Optimization rules
The method inspired by Gallagher et al. [50] and Machaj et al. [75] is Prcntilx. The operational RSSs values are ranked from the strongest to the weakest. The strongest APs falling in the x percentile of that rank are used to find all the fingerprints in the radio map which contain these APs in the x percentile of the corresponding ranks.
The method proposed by Marques et al. [26] is Mar-ques10. It uses the 1 st and 2 nd strongest APs of the on-line fingerprint if their difference is 10 dBm or lower.
The methods based on Yu et al. [27] are FengYu, FengYuOpt and FengYuOptx%. FengYu follows the original method. FengYuOpt additionally pre-computes the reduced radio map for each AP off-line. FengYuOptx% modifies FengYuOpt by considering only fingerprints from the radio map that have at least x % of the total number of APs detected in the on-line fingerprint, similar to [50]. FengYuOptx% does not apply its modification if the reduced radio map is empty.
The methods based on Moreira et al. [23] are Moreira1st, Moreira3st, MoreiraS06 and MoreiraS12. Moreira1st applies the basic filtering described for building estimation. Moreira3st applies the filtering described for the variant 1 of floor estimation. MoreiraS06 and MoreiraS12 apply the filtering described for the variant 2 of floor estimation using 6 or 12 dB as maximum RSS difference respectively.  2 Features of the selected databases. Fingerprint density is computed as the number of fingerprints per reference point (δ fp ). Local density is computed as the average number of fingerprints per m 2 in a circle with a radius of 5 m from each reference and evaluation point (δ T and δ V ). The scenario size is represented by its dimensions or approximate area (Dimension/Area), number of floors (#f), and number of buildings (#b). The average number of APs detected in the fingerprints is shown (Valid APs). The number of devices used to collect the dataset is also shown (Dev.).

Description of data sets
The experiments carried out in this paper include 16 datasets (see Table 2) from 12 different data sources. They have been selected for the experiments because they have diverse characteristics: small, medium and large scenarios; single-floor and multiple-building scenarios; unprocessed RSS data and averaged RSS data per reference point or grid cell; single device collection and device diversity; systematic and crowdsourced data collection; and spatially-sparse but dense radio maps. Moreover, two datasets were collected in the same place following the same data collection strategy with a time interval of 10 months. The strategies used to collect the datasets are summarized in Table 3. Dataset DSI 1 was collected at the Department of Information Systems of the Universty of Minho (Portugal). The dataset DSI 2 is a version of DSI 1 where the repeated instances of the same fingerpring -same RSS values in the same RP -have been removed. Dataset LIB 1 was collected on two floors of Universitat Jaume I Library (Spain) in June 2016, whereas LIB 2 was collected in the same conditions in April 2017. MAN 1 is a dataset that covers the corridors of the second floor of an office building on the campus of the University of Mannheim (Germany), the evaluation set has been reduced by randomly picking 10 fingerprints per evaluation point with respect to the original dataset [60]. In the MAN 2 dataset, the fingerprints from the original dataset have been averaged in 10 blocks of 10 fingerprints for the training and evaluation sets to have one dataset with averaged RSS and fingerprints. We include an artificial dataset, SIM, based on simple the path-loss model with additive Gaussian noise (eq.1) as done in [62,76,77,78].
where s 0 = −40 dB, α = 2, d refers to the distance to the AP ap, d 0 = 1 and X(t) corresponds to the noise modelled as a Gaussian random process with null mean and σ = 2 dB (as it is usually between 2 and 3 [54,76,77,79]). Datasets TUT 1, TUT 3 and TUT 6 were all independently collected in a five-floor building at Tampere University (Finland), but different actors and data collection strategies were used (see Table 3). TUT 4 is identical to TUT 3, but we used the training points as the test points and vice versa. TUT 3 and TUT 4 were collected by crowdsourcing means. TUT 2, TUT 5 and TUT 7 were all independently collected in a three-floor building of Tampere University using different actors and data collection strategies (see Table 3). UJI 1 was collected on three multi-storey buildings of the School of Technology in Universitat Jaume I. UJI 2 contains all UJI 1 data as training set and a new blind test data set collected in a 12-month interval for the IPIN 2015 competition [68].
Although the selected datasets mainly come from four different research teams, the data collection strategy, data structure and the data formats are diverse, cf. [29,60,63]. Due to the different dataset sources and formats, we had to normalize the datasets and apply a common simple format.
The suggested common data format includes training and evaluation data in separated structures, where the inputs (RSS values) and outputs (positions) are also separated.
The input values (RSS) for the training data are stored in a |T | × |A| matrix, where the non-detected APs are expressed with the value +100. The output values (position and labels) for the training data are expressed in a |T | × 5 matrix, This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMC.2020.3017176 where each row contains: the x, y-coordinates (in meters, either in a local or global reference system), the height, the floor identifier and the building identifier. Similarly, the evaluation data include a |V| × |A| matrix with the inputs and a |V| × 5 matrix with the ground truth. For single-floor and/or single-building datasets, the floor and building identifiers were set to 0. For the datasets where the floor height was undefined, it has been calculated as the product of the integer floor identifier and a default height value (3.7 m). The supplementary materials include a copy of the datasets or the scripts to generate them [69].

THEORETICAL ANALYSIS AND EXPERIMENTS
One of the objectives of this work is to compare the computation burden associated to each one of the evaluated methods. This section approaches such comparison from a theoretical, when possible, and experimental points of view.

Time complexity of fingerprinting on-line stage
Before we present the experimental results, we provide an analysis of the asymptotic time complexities of the FP on-line stage, including the clustering and optimization rules. The on-line stage consists of three main processes: i) reduction of the radio map, ii) matching of the training and operational fingerprints and iii) the computation of the position, floor, and building (cf. Alg. 1). For the analysis, we exclude the position estimation stage from the analysis, because it is common for all methods based on k-NN.

Reduction of the radio map
The reduction of the training dataset differs for clustering methods and for optimization rules. Furthermore, both the clustering and optimization rules process either RPs, or RSSs or simply the APs identifiers.
Clustering-based methods obtain the reduced training dataset by finding the cluster that is closest to the operational fingerprint. To that end, the minimum distance between cluster heads and the operational fingerprint is computed. This operation has linear complexity O(c), assuming a simple distance measure is used and where c = |C| is the number of clusters.
In the optimization rules, the RSSs of the operational fingerprints are processed and intersected with the training dataset to obtain the reduced dataset. This operation is linear O(m) for many methods too, where m is the number of elements in the training dataset (m = |T |) or the set of reference positions (m = |P|). However, some optimization rules use the strongest(s) APs to filter the radio map or sort the APs to compute the quantiles on the operational fingerprint. In these cases the worst case complexity of reducing the training database is at best O(m log(m)), with m = |A|, in common implementations 1 . The methods FengYu (and variants), Prcntilk, Marques10, Moreira1st and Moreira3st employ algorithms with that asymptotic complexity.
In worst case, the number of clusters may reach theoretically the number of training fingerprints, c ≤ |T |, or RPs, 1. Octave implements 'Timsort' which has a worst-case complexity of O(n log(n)). Octave's computation of the median exploits nt_element and partial_sort of the standard template library of C++, both also with asymptotic complexity of O(n log(n)).
c ≤ |P|, depending on the clustered quantity. Also the the optimization rules may process all RSSs or labels, so that m ≤ |A|. Unreasonable parameter choices or degenerative distributions of RSSs or RPs may cause such situation. In practise however, these are rather unlikely situations as the number of clusters is usually much smaller than the number of reference fingerprints. For instance, the number of clusters was set to small fractions (ρ = 0.01, 0.05, 0.1) with respect to the reference fingerprints [55]. Similarly, the number of processed APs in optimization rules is much lower than the set of APs identified in the dataset. Figure 2 supports the notion of a moderate average computational complexity also for the methods that require sorting to reduce the training set. It exemplifies the histogram of the number of valid APs per reference fingerprint using all datasets. The histogram shows that two thirds of all reference fingerprints considered in this work contain less than 30 valid APs. That is, the number of elements to sort is indeed much smaller than the total number of APs, and therefore, it has usually has not a large computational cost attached and is not critical. However, the number of detected APs in an operational fingerprint is variable and also depends on the dataset, the location of the operational sample, and the device used to collect the fingerprint, as reported in Table 2.

Matching of the training and operational fingerprints
Once the radio map is reduced, the worst-case time complexity of the fingerprint matching is determined by the choice of the distance measure. The majority of distance measures (see [80,81]) have in fact the same worst-case time complexity, namely linear complexity O(n), where n = |T | is the number of fingerprints of the reduced radio map. The reduced radiomap depends on the clustering and optimization rule. Theoretically, there exist scenarios for all clustering and optimization rules in which the training dataset will not be reduced, so that the maximal value of fingerprints is n = |T |. Not only unreasonable parameter choices may lead to that case, a degenerative distribution of fingerprints or APs may induce such a situation too. For example, if the environment is small and only one dominant AP is observed (the methods and variants proposed by Moreira et al. [23], Marques et al. [26], and Yu et al. [27] are vulnerable to this), if the RPs are all located in a small part of the environment and end up in a single grid, or if the number of fingerprints per RP is one for the method This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMC.2020.3017176 Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
proposed by Kushki et al. [8]. However, the inclusion of optimization rules and clustering methods in commercial [82] and competing systems [23] somehow indicate that they efficiently reduce the computational costs when they are properly configured for a particular area.
The computational cost for the clustering methods can be simplified with f (C) = |C| + |T | |C| , ∀|C| ∈ N : |C| ≤ |T |, which shows the vector comparisons required for the coarse (|C| vector distance calculations) and fine-grained ( In contrast, the optimization rules apply some knowledge-based rules to decide whether a reference fingerprint is included in the reduced radio map or discarded. The rules based on the strongest(s) APs, for instance, keep those fingerprints that are near the dominant APs. Thus, the area covered in the reduced radio map not only depends on the location of the operational and reference fingerprints but also on the the APs distribution. Again, the reduced radio map size is tightly coupled to the operational fingerprint.
The worst-case complexity analysis helps to understand the trade-offs to be made, but it does not provide a realistic comparison between the considered methods. A theoretical assessment of the average computational complexity would provide a more accurate picture and possibly a guideline to alleviate the computation burden of FP methods. However, such an assessment depends on multiple dimensions, including the distributions of RSSs, RPs and/or APs of the datasets and, finally, the parameter choices for the clustering and optimization rule(s). This analysis is not feasible in this paper due to number of analyzed alternatives to reduce the radio map and the inner diversity of the considered indoor environments (i.e, the 16 datasets). Instead, for the remaining of this study, we opted for an experimental approach and we assess the average computational complexity through measured execution times, including as well the effect on the positioning accuracy. We encourage that further fingerprint models based on optimization rules or clustering include a theoretical assessment as done, for instance, in [8].

Empirical experiments
The purpose of this analysis is to explore the weaknesses and strengths of the selected methods to reduce the complexity of the radio map and identify hidden general patterns with respect to the datasets. All the experiments were executed on a cluster based on Intel Xeon E5-2670 processors, with 128 GB of RAM and GNU Octave 3.8.2. The results have been confirmed using distinct hardware and software.

Evaluation Framework
According to [81], the three main parameters of this model are: the data representation for the RSS values, the distance metric and the value of k. We selected two parameter configurations to test how the clustering models perform on different datasets with different parametrization of k-NN: i) the Simple Configuration with 1-NN, Manhattan distance (also known as City Block or L1 distance) and positive data representation; and the Best Configuration, which resulted from exploring multiple combinations of the main parameters. For setting the Best Configuration, we have considered three data representations, namely positive, exponential and powed [81], eight distance metrics, namely Euclidean, Manhattan, Euclidean 2 , Neyman, Sørensen, LGD, PLGD10, PLGD40 [80,81,83], and k = {1, 3, 5, 7, 9, 11}. We apply this evaluation setup on the 16 datasets introduced in Section 3.3. The results on plain k-NN are shown in Table 4 for each dataset. Table 4 provides the absolute error 3D and cost τ DB for all the datasets, but it also includes the normalized values (˜ 3D andτ DB ). The Simple Configuration on the plain k-NN has been used as the baseline for normalization, so that the normalized results showed on this paper are all relative to it. The last row shows the average over the 16 datasets for extracting general conclusions and further comparisons.
The table also shows three relevant outputs. First, the optimal value of k seems to depend on the local density (δ T in Table 2). For an operational fingerprint, the number of relevant very similar reference fingerprints highly depends on the fingerprint density of the radio map as already suggested in [84]. Therefore, the value of k should be set accordingly (e.g. k = 1 for datasets with low δ T ). Second, selecting the best performing configuration can have a significant impact in both the accuracy and the computational costs. The normalized positioning accuracy,˜ 3D , is halved for some datasets and the normalized computational times, τ DB , drop about 5 %, increase about 15 % and reaches 3 times the value of the normalized computational times for data sets where the Sørensen-, the Euclidean 2 -and Probabilistic Log-Gaussian (namely PLGD10 and PLGD40) [83] distances were used, respectively. That happened under different configurations of k and data representation, which seems to indicate that the impact of the distance metric is constant. i.e., the computational costs are independent to k and the data representation, as clearly shown in database MAN 1. Third, the normalized aggregated metrics provided in the last row show that, in general, selecting the Best Configuration improves the accuracy by 26 % at the expense of increasing the computational burden by 76 %. This computational increase is mainly caused by the six cases where a PLGD distance metric was selected, and it could have been avoided by selecting an alternative configuration with similar accuracy but lower costs (e.g. Sørensen). PLGD includes complex penalty terms and exponential operations [83], whereas Sørensen just adds a dynamic normalization term to Manhattan distance [80]. Despite DSI 2 is the reduced version of DSI 1, the computational cost of DSI 2 is much higher than DSI 1 in the best configuration.

Dataset-wise Analysis
In order to analyse and present the results over the multitude of methods and datasets, we introduce a visualization that allows to depict the four relevant dimensions on once: the datasets, the method to reduce the computational costs, the positioning accuracy and the computation time. This visualization shows the normalized aggregated metrics for each combination of dataset and method, as colored ellipses. The color indicatesτ DB compared to the baseline, where dark green stands for 0, white for 1 and the darker the red the higher the computation time. The shape stands for the˜ 3D , a horizontal ellipse represents values closer to 0, a circle identifies an error of 1 and a vertical ellipse stands for an increased error, compared to the baseline. The methods based on grid clustering, Kushki and c-Means, depend on an additional parameter; which are set for each dataset according to the best error (BE) and best time (BT) as shown in the example Figure 3. This reduces the reported methods and enhances the clarity of the full results shown in  This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TMC.2020.3017176 Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. similar behavior is found in datasets LIB 1 and LIB 2, where the evaluation environment had 8 APs which were detected in almost all the RP. Those cases include relative small areas with low attenuation where an important subset of APs are received over all the area and, therefore, the rule proposed by Yu et al. [27] does not reduce the computations. For the Best Configuration, the visual results of the plain k-NN indicate that the alternative distance metrics are, in general, more time-consuming than the Manhattan distance (there are reddish ellipsoids for almost all datasets). The choice of notably expensive distance metrics also affected the computational burden for clustering methods (reddish ellipsoids on Feng Yu and Percentile rules).
BE-Kushki reports an accuracy similar to plain k-NN. However, Figure 4 shows that BE-Kushki is not suitable for TUT databases and UJI 2. They contain many reference points with just 1 fingerprint. In those cases, the number of clusters is so high that the coarse grained search has a computational load similar to the matching stage of the plain k-NN. Moreover, it is not suitable for the SIM dataset, in which all APs are detected in all fingerprints. Therefore, the coarse search never filters the radio map. However, it has the expected behavior in the other datasets, reducing the computational load while keeping the accuracy.
The rules proposed by Marques and Moreira gave good results in both dimensions for many cases. The cases include datasets LIB 2, TUT 1 and TUT 2 for the Best Conf., which reported very high computational cost in plain k-NN. For datasets TUT 6 and TUT 7, it seems that only Moreira3st, MoreiraS03 and MoreiraS06 are suitable. Their low positioning error in the baseline was really hard to improve.
Finally, the results reflects the heterogeneity of datasets. TUT 1 and TUT 2, where the samples lie in a regular 1 m grid, provided the worst accuracy for the baseline. The methods based on grid clustering report either very large error or very large computational time in most of cases. In contrast, TUT 6 provided the lowest accuracy for the baseline and further reductions on its computational cost usually result in large increases in the error. TUT 6 applied a strict systematic data collection without post-processing, in contrast to TUT 1, TUT 3 and TUT 4 that applied grid average or crowdsourcing in the same environment. Furthermore, methods based on grid-clustering never provide good positioning error for UJI 1, which has fingerprints from at least two devices at every reference point. Table 5 shows the aggregated normalized results for all the clustering methods and optimization rules described in Section 3.2. It contains the average normalized error and computational time over the sixteen datasets. The values in parenthesis show the standard deviation of the averaged values. These two metrics have been calculated as shown in the two last rows of Table 4 for the plain k-NN method. Most of the methods have more than one entry in the table since different parameters were tested, BE and BM denotes the best error and best time for each dataset considering different parameter values (see Figure 3).

Method-wise analysis
As before, the baseline corresponds to the results of Simple Configuration in Table 4. Thus, the plain k-NN algorithm without any radio map reduction provides an  This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. averaged normalized error and computational costs of 1 for the Simple Configuration, whereas the mean normalized error is 0.74 (26 % lower than the baseline) and the mean normalized cost is 1.76 (76 % higher than the baseline) for the Best Configuration in Table 5. Figure 5 visualizes these results in a scatter plot. Selecting optimal parameters tends to decrease the positioning error, usually at the expense of higher computational costs. This negative slope can be clearly seen in three blocks: Kushki, methods based on grid-based clustering and other methods. The last groupincluding c-Means, Affinity Propagation and the majority rules-contains the models that achieve a good trade-off between the positioning error and the computational time.
Only c-Means, Affinity Propagation Clustering and the methods based on the strongest AP consistently achieved a normalized accuracy below 1.25 and a normalized computational burden below 0.25 (marked with * in the Table 5).
The computational cost of c-means decreases as the value of c increases, but the accuracy also decreases as shown in Figure 6. The trends reported in the figure also show that the parameters of c-Means (Traditional/Alternative generation of clustering and the value of c) have an impact in the normalized error and computational time. In general, the normalized accuracy of the traditional c-means is worse than the plain k-NN algorithm, whereas the results of the alternative c-means are similar to the plain k-NN method. Among all the methods based on c-means clustering, rfp1-MeansAlt provides a remarkably good compromise between accuracy improvement and time reduction, while also having a low variability.  Table 5. The Affinity Propagation Clustering, while providing a notable reduction of the computational cost, may require significant memory and computational resources for the clustering stage. We had to use the Sparse implementation provided by its developers because of memory shortage problems for large datasets. Even using the Sparse implementation, the time required to compute the centroids was very high, lasting many hours for the largest databases.
The methods based on the strongest AP proposed by Marques and Moreira provided remarkable time reductions while commonly improving the positioning accuracy. The method proposed by Moreira that filters by one AP (Mor-eira1St) provides the best overall computation time, at the expense of slightly increasing the normalized positioning error, to 1.15 and 1.00 for the Simple and Best Configuration, respectively. The more APs are used in the filtering stage (1 or 2 in Marques10, 3 in Moreira3St), the more the positioning error reduces at the expense of a higher computation time. For the version of Moreira that is based on the similarity to the strongest AP, the positioning error and the computation time are well balanced when the threshold value is set to 6 dB (MoreiraS06). The accuracy is close to that of the plain k-NN but the normalized time is reduced by factor 8.
Other methods miscarry either in the error or time improvement. FengYu failed because the radio map reduction was entirely done during the operational phase without any off-line pre-processing. The methods based on the percentile rule kept the positioning error but did not accomplished a notable time reduction. The grid-based clustering methods performed very poorly in terms of error. The clustering proposed by Kushki led to an unexpected high computational time. The possible causes are explained in Section 4.2.2.

DISCUSSION
Some attempts to reduce computational costs were identified in a literature review on FP in the major research data sets. Although most of them relied on k-Means clustering (c-Means in this paper) and Affinity Propagation, other interesting rules were found. We implemented and evaluated the reproducible ones, those that provided enough implementation details. The proposed evaluation framework allowed a deep analysis, which led to the following observations.
The election of some parameters of the FP method, like the distance function to compare the fingerprints, are usually only based on the positioning accuracy. According to our experience and the empirical results, we encourage to ponder the computational cost when selecting the most appropriate distance metrics. The Sørensen distance provides good accuracy with a reasonable increase of the computational costs. In contrast, the metrics based on probabilistic log-Gaussian distance provide the best accuracy in some cases but they have three times the computational cost of the Manhattan distance.
The computations in the on-line stage of clustering and optimization rules mainly depend on the complexity required to reduce the radio map and the resulting number of fingerprints. In some cases, the operations required to reduce the training set can be done off-line, which alleviates the computational costs during the on-line phase as we demonstrated with the optimized FengYu method.
Although some methods promise to reduce the computational load while barely affecting the accuracy, their assumptions are not applicable to many situations. Examples of these assumptions are, for instance, having multiple fingerprints in every reference point or a uniform distribution of fingerprints over the space. Usually, the methods' original evaluation fits the requirements and the system does not present any anomaly. However, they might provide a poorer accuracy when the evaluation is more realistic and includes features breaking those assumptions.
Grid-based clustering is not suggested. Averaging fingerprints only works when the distance between RPs is larger than the cell size and every RP has several fingerprints taken in the same conditions. Averaging should never be used neither when the local fingerprint density is very low not when it includes samples from different sources.
The methods based on the strongest AP proposed by Marques and Moreira, c-Means clustering and Affinity Propagation Clustering (APC) appear to be universally eligible to reduce the computational cost in the on-line stage of FP positioning while keeping an accuracy similar to plain k-NN. However, the computational costs and resources required to generate the clusters with APC might be prohibitive in large datasets.
Regarding c-Means clustering, the variant presented with the alternative initialization creates effective clusters that better group similar fingerprints with respect the traditional initialization. The two-stage search is effective as in the coarse search, the right cluster is selected and, then, the fine-grained search provides the right fingerprints to compute the position estimation. As explained in Section 4.1, the computational cost decreases as the number of clusters increases, which find the best scenario around the proposed heuristic rf p1 (square root of reference samples). However, as a side effect, the positioning error increases. As the number of clusters increases, the probability that the operational fingerprint falls near a boundary between clusters also increase. Selecting the wrong cluster and having reference fingerprints from the same reference position scattered in different clusters are the main causes of this side effect of c-Means with large c values.
In general, none of the methods guarantees that the reduced training sets have all the same size. The generated clusters are generated to group similar fingerprints in the feature (RSS) space, but they may not be equally distributed as the groups depend on distribution of APs, localization of the RPs, the devices used to collect the data and the noise in signal propagation. Similarly, in the optimization rules, the size of the reduced radio map depends also on the coverage of the APs detected on the operational phase. In other words, the computational costs vary depending on the reduced radio map linked to the operational fingerprint.
To sum up, most of the analyzed methods apply an off-line pre-processing stage [85]. It is devoted to create supporting data, e.g. clusters and reduced radio maps, for the on-line phase. That pre-processing is highly relevant for production systems, since it might avoid unnecessary calculations in the on-line phase that degrade a system's scalability. However, this pre-processing stage is not negligible, like in the case of Affinity Propagation Clustering where it took several hours for large radio maps on our hardware.
Finally, some of the analyzed methods could not be implemented due to the lack of details in the publications introducing them. Lack of details or procedures to set some parameters are the most common issues we faced when implementing the methods found in the literature. The Indoor Positioning community should promote the diffusion and communication of new methods ensuring reproducible research, e.g., publishing the code and data in a public repository. Also, different scenarios (e.g., through datasets) should be considered for generalization purposes.

CONCLUSION
This paper presents a comparison of different clustering and optimization rules to reduce the computational burden of Wi-Fi fingerprinting (FP) methods. Although researcher have already published partial results, they cannot be compared as the evaluation scenarios and metrics differ. Also, the results are usually restricted to one deployment (research facilities or small area) and cannot be generalized. The Indoor Positioning community needs an evaluation framework similar to the one used in Machine Learning with multiple datasets.
To the best of our knowledge, no other work has used an evaluation framework as comprehensive as the one we have presented here, which includes two aggregated normalized metrics and 16 datasets. We implemented 15 methods, testing several parameter values for some of them, to perform an empirical comprehensive comparison. An evaluation framework with heterogeneous datasets not only allowed us to generalise better on the evaluation metrics, but also enables the research community to compare new methods against a large set of deployments through datasets.
Balancing the general accuracy and the general computational costs, MoreiraS06 could be appointed as the best model within the analysed methods. Moreover, this model somehow benefits from the fact that two fingerprints sharing the same strongest AP with similar RSS value should be close in the space, which reduces the computational costs of FP without degrading its accuracy. In general, the methods based on the strongest AP proposed by Marques and Moreira, c-Means clustering and Affinity Propagation Clustering appear to work well in all considered scenarios, and thus are likely to be universally eligible. However, the cluster generation of Affinity Propagation is the most demanding one, requiring several hours or facing some execution issues in the largest datasets. The problem of averaging fingerprints with different features makes the methods based on clustering to be the best choice for single-device datasets with large local density, whereas the methods based on the strongest AP are mostly suited for datasets with low local density or collected by diverse devices.
The efficacy of reducing the computational costs of the on-line stage depends not only on the clustering or optimization method itself but also on the number APs and the spatial distribution of the fingerprints. A developer of a FP system should keep this in mind. Bold assumptions and requirements during the evaluation stage are not encouraged. Moreover, the way of implementing an algorithm may affect the computational costs. We should move as much as possible computation to the off-line pre-processing stage.
As further work we will proceed on the implementation of the FP models in other computer languages, including C (assembler) which is used for efficient implementations in embedded devices with few resources, to test, for instance, their feasibility for mobile apps. Moreover, our long-term objective is to settle the best practices for evaluating indoor positioning systems with multiple scenarios and datasets in order to ensure reproducible research. Our contribution to this goal starts here by making available the datasets and open-source code used in this work. The community can extend the proposed evaluation setup with their datasets that consider more realistic propagation models, devices not included in this work (new smartphones or computing devices) or, even, disruptive data collection strategies.