UJIIndoorLoc-Mag: A new database for magnetic field-based localization problems

Indoor localization is a key topic for mobile computing. However, it is still very difficult for the mobile sensing community to compare state-of-art Indoor Positioning Systems due to the scarcity of publicly available databases. Magnetic field-based methods are becoming an important trend in this research field. Here, we present UJIIndoorLoc-Mag database, which can be used to compare magnetic field-based indoor localization methods. It consists of 270 continuous samples for training and 11 for testing. Each sample comprises a set of discrete captures taken along a corridor with a period of 0.1 seconds. In total, there are 40,159 discrete captures, where each one contains features obtained from the magnetometer, the accelerometer and the orientation sensor of the device. The accuracy results obtained using two baseline methods are also presented to show the suitability of the presented database for further comparisons.


INTRODUCTION
Many real world applications need to know the localization of a user in the world to provide their services. Automatic user localization consists of estimating the position of the user by using an electronic device, usually a mobile phone. Outdoor localization problem can be solved very accurately thanks to the inclusion of GPS sensors into mobile devices. However, GPS has severe problems in indoor environments. Many different approaches tried to solve the problem of indoor positioning in the last years. They can be categorized, according to [1], as infrastructure-based (RFID, infrared, ultrasound, Bluetooth) and infrastructure-less (Wi-Fi [5], FM radio frequencies [12], Magnetic field [3][4]) technologies.
The use of the Earth magnetic field for indoor localization is an interesting infrastructure-less method that is attracting the attention of many researchers in the last years. Indoor environments have some structures (ferrous structural materials, pipes, wires, etc.), which alter the Earth magnetic field. Even the presence of quotidian objects, such as metallic stoves or speakers, may alter the magnetic flux density in the surrounding areas [8]. Actually, the measured magnetic field can substantially vary between two points very close in the space. Thanks to that, sub-meter accuracy level can be theoretically achieved for indoor location. The variations in the magnetic field in indoor environments can be measured and recorded with available sensors inside smart phones [3,7].
Although there are many papers in the literature trying to solve the indoor localization problem using a magnetic fieldbased method, there still exists one important drawback in this field, which is the lack of a common database for comparison of methods. Each approach presents its estimated results using its own database. Under these conditions, it is not possible to compare different methods since the particularities of each experiment are hardly reproducible. In the Pattern Recognition and Machine Learning research fields, the common practice is to test the results of each proposal either using a well-known dataset or providing the dataset used. In this way, researchers are able to fairly compare different methodologies in the literature. The UCI Machine Learning Repository [6] is a wellknown example in this sense. In fact, there is an available database for comparing WLAN fingerprint-based indoor localization methods [2]. However, in the magnetic field-based indoor localization field does not exist such kind of database.
The main contribution of this work is the creation and the introduction of the UJIIndoorLoc-Mag database, which is the first publicly available database that could be used to make comparisons among different methods in this field. It has been published on the UCI Machine Learning Repository [6]: http://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc-Mag .
The database consists of 281 continuous samples (270 for training and 11 for testing) taken in our 260m 2 (15x20m approx.) laboratory. Each sample comprises a set of discrete captures taken along the 8 corridors (including intersections) of the laboratory with a period of 0.1 seconds. There are almost 40,000 discrete captures obtained from the magnetometer, accelerometer and orientation sensor of a mobile phone.
Two basic baselines are also presented to test the suitability of UJIIndoorLoc-Mag and they can be considered a simple starting point for further comparisons. We do not expect to obtain high accuracy with the baseline since we test the suitability of the database, we are not introducing a new indoor positioning system. The rest of the paper is organized as follows. Section II presents the related work. Section III shows some prior tests we performed on magnetic field indoor positioning. Section IV introduces the main elements of the database and how it was made. Section V is devoted to explain the two baseline algorithms tested and the results obtained. Finally, Section VI presents the most important conclusions arisen from this work.

II. RELATED WORK
As it has been commented before, there are many papers in the literature dealing with magnetic field-based methods for indoor localization problems. Some of them are reviewed in this section [3,4,7,8,9,10]. We focus on the dataset used for testing the proposed algorithms and we also show whether they are publicly available or not.
Four experiments were done in [3] to demonstrate the feasibility of using the magnetic field for positioning. In the first one, data were collected at one specific location in six different environments. In the second one, data were collected at five overlapping corridors. In the third one, data were collected in the intersections of two different squared and regular grids. In the last one, magnetic field changes in the vertical direction were studied with 5 cm. of resolution.
Although the experiments and results were detailed, some basic information about the databases was not commented.
The experiment presented in [7] took place on a rectangular-shaped, 67x12 m 2 , corridor where its surroundings included spaces such as lab, office, and library. So they considered and environment of 4 lineal corridors, where the distance between parallel corridors was high, 12 m. and 67 m. Moreover, data were statically collected with 45 cm. intervals and 10 seconds spent in each location. Their training database consisted of 350 samples (approx.) with 5 features, including location (x,y) and magnetometer values in the three axes. However, information about collected data and their magnitudes were not described.
In [8], the authors demonstrated that geomagnetic localization performs reasonably well when the three components of the magnetic field -X, Y, and Z axes -are considered. They tested their positioning system in three different environments: a suburban house, a city centered apartment and a University lab. Data were collected as the magnetic flux density at 1 m. spacing. Moreover, they also conducted a magnetic fingerprint test in a 3.5x3.5 m 2 bedroom. However, they did not detail the number of samples.
In [9], the authors selected a corridor of a multi-level building to evaluate the performance of using geomagnetic field information for positioning with four different devices. The corridor was about 36 m. in length and 2 m. wide. Samples were taken along the corridor at three different positions: 1) centered, 2) 60 cm. left to the corridor the center and 3) 60 cm. right to the corridor center. A total of 20 points were used for testing purposes. Their scenario was narrow and realistic, because three different parallel paths in a 2 m. wide corridor composed it. An indoor location system based on a wearable device was successfully introduced in [10]. The system is tested in two very different environments, a 187 m. corridor loop scenario (37200 training samples and 310 test data points), and an atrium scenario (40800 training samples and 408 test data points). They also examined the fingerprint difference between floors using a dataset with 60 points from each floor. They used a special device with four magnetometer sensors for sampling the magnetic fingerprints, so vectors consisted of 12 elements. Table I summarizes the databases used in the previous reviewed works [3,7,8,9,10]. We have identified three different types of databases (groups 1, 2 and 3 in the table) according to how samples have been taken: 1) continuous samples taken in a lineal environment (such as a corridor), 2) discrete samples taken in a lineal environment, and 3) discrete samples taken in a two dimensional space. Please note that a single continuous sample corresponds to a sequence of some consecutive discrete samples taken in a lineal environment. 40.000 discrete In the group 1, continuous samples, length corresponds to the number of individual measures taken in a single continuous sample * 3 components of the magnetometer ** 3 components of the magnetometer + 3 components of the orientation + 3 components of the accelerometer + timestamp *** m stands for the number of corridors or samples.
For each corridor/segment in the trajectory we store the XY coordinates of initial and final points and the indexes of the initial and final samples **** The authors commented that there were 100 datasets (not samples) Although the soundness of results and conclusions presented in all those contributions is high, the databases employed were not totally detailed and their access was restricted (not public) in all studied cases. For instance, the information about how locations are stored is not always provided. This information is described in some works (such as [7]), but it is omitted in the majority of contributions (such as in [3,8,9,10]). To denote that this information was not provided, we used loc to refer to location in Table I.
Although the number of continuous samples used in the experiments seems to be low, 5 in [3] and 3 in [9], the length of the vectors was high enough to perform the experiments. However, our database contains more information than their ones and it includes 270 continuous samples (35,779 discrete samples) for training and 11 complex continuous samples (4,380 discrete samples) for testing. In our case we do not only consider corridors, but also combinations of two connected corridors (turns changing corridor).

III. PRIOR TESTS
Prior to generating the database, we performed some basic tests to determine the feasibility of using the Earth's Magnetic Field for indoor positioning using mobile phones. Moreover, we also gathered information about the features to be stored.

A. It is feasible to use the magnetic field for location?
First, we selected two simple trajectories in our laboratory (see Fig. 1). The first one consists of two segments; the user comes into the laboratory and goes straight on until the top side windows, then turns right and goes straight on until arriving the right side windows. The second one is a simpler scenario where the user goes straight on through a corridor.
The first test consisted in recording the values provided by the magnetometer. This first test was repeated 5 times. It is important to mention that the sensor provides a vector that corresponds to the strength and direction of the magnetic field. This vector is relative to the mobile device as shown in Fig. 2 and the values are measured in microtesla (µT). The example vector shown in the figure means that there is a magnetic field of 46.669 µT strong in the direction of 45 degree to Y-axis and Z-axis of the device.
The sampling frequency has been set to 10 samples per second. It balanced the computational costs, energy consumption and time series resolution.  At first sight, it can be observed that the magnetometer values and the curves are similar for the five runs according to the plots of the first trajectory. But when we show the results with more detail (second trajectory), we can see that the magnetic values are not exactly the same in the five trajectory's runs, but their differences are low, about 5 µT.
However, it is not trivial to detect a user's orientation change (turn) with the information provided by the magnetometer. Therefore, we decided to record raw data from the orientation sensor too. The orientation sensor provides the direction vector and the values are measured in degrees. This vector is also relative to the mobile phone (see Fig. 2). Fig. 4 shows the orientation of the device for the two trajectories. In this case, both plots also have different scales. Moreover, we show a simplified orientation instead the vector values for visualization purposes. There was a significant change of user's orientation in the first trajectory ( 90º) according to the Fig. 4 (left) because the user did a L-turn, whereas the changes of user's orientation in the second trajectory (see Fig. 4 right) should be considered insignificant ( 5º) and they may be due to user's movement.

B. It is necessary to include accelerometer values?
The second test consisted in storing the magnetic field values provided by the magnetometer along with the values provided by the accelerometer. The later sensor provides a vector with the accelerometer values expressed in m/s 2 . Those values have been processed to remove the gravity forces and therefore to have an estimation of user's real movement.
In particular, we recorded the magnetometer and processed accelerometer values through a corridor. We repeated this test with three different speed conditions. In the first one, the user was walking slower than usual. In the second one, the user was walking at a normal speed. Finally, the user was walking faster than usual, without getting running speed. Fig. 5 shows the combination of magnetic values and the processed accelerometer values on the Y-Axis. We found that this axis was representative enough to detect the user's steps, and therefore estimate the speed.
First of all, the shape of the magnetic curves in the three axes may be considered similar for the three different cases. However, the horizontal scale (time) varies significantly in the three configurations. In the first case, slow speed, the time required to capture values through the trajectory was 12 seconds approximately (121 samples), whereas time was reduced to a half in the third case with the fastest speed.
We consider that there may be two alternatives to deal with user's speed in indoor positioning. The first one consists of resampling the training or the operational samples to allow its comparison. Resampling is the procedure to dilate or compress the sequence of discrete captures to have the same spatiotemporal resolution. The other alternative consists of mapping the scenario under some different speed configuration, and using an advanced method to determine the speed configuration at operational stage. Therefore, the appropriate training samples from the full training/reference set could be selected depending on user's speed.

C. Lessons learnt from prior tests
After performing some prior tests, including the ones shown in this section, we decided that data from magnetometer, accelerometer and orientation sensors should be included in the proposed public database. Researchers may combine all this information in order to improve the indoor positioning systems. For example, the user's speed, turns, and other common situations could be estimated, and this new information could benefit Indoor Positioning Systems' (IPS) accuracy.
Moreover, we also considered important to record the exact moment in which each discrete sample was taken. We detected that some minor delays could be introduced between two consecutive samples. Moreover, this timing information may be useful for further spatio-temporal analysis such as 'is the time a factor to consider for magnetic field based indoor location?'. However, this kind of questions is out-of-scope. Although the analysis of this information is complex, some information can be extracted by interpreting the different plots. For instance, the user turned to the left and then turned to the right, such as in the testing trajectories number 5, 7 & 10 (see Fig. 8). The user reduced the speed between the first and second turn because she/he was, maybe, avoiding and "obstacle" because there were some people in the middle of the corridor. Moreover, the two consecutive turns produced an abrupt change in the magnetic field for the three axes.
Most of the situations that may occur in an indoor environment (e.g. the presence of people and other obstacles in a corridor) should be considered while mapping it. Turns, including L-Turns and U-Turns, should be mapped to have a complete reference database, because the IPS's accuracy may depend on the situations recorded in the reference database. If turns were not considered in the mapping procedure (generation of the reference databases), we would be unable to detect them only with the magnetic data at the operational stage.
Thus, the most important lesson, which we learned from the prior testing experiments, was that having a good reference dataset was essential to develop an accurate Indoor Positioning System based on Earth's Magnetic Field. Therefore we planned to map our laboratory considering all possible natural turns (see Section IV).
Here we publish a dataset in which values from time, magnetometer, accelerometer, and orientation sensor have been recorded. The procedure to map took some time to plan and develop it. So, our principle while collecting data was to record as maximum information as possible according to our current knowledge. The unuseful data can be removed or omitted by the location algorithm.

IV. THE UJIINDOORLOC-MAG DATABASE
This section introduced the UJIIndoorLoc-Mag database main features. All the samples were taken in our 260 m 2 laboratory, which is composed by 8 corridors.
In this office, bookcases and desktop tables are the elements that separate the corridors as shown in Fig. 1. The laboratory is located in the fifth floor of the Espaitec-2 building at Universitat Jaume I university campus.

A. General description
The database contains mapping samples alongside the 8 corridors and all the intersections between two corridors. We consider that mapping "intersections" could make a more robust reference database, so we recorded the sensors values when the user was turning to change the corridor where he/she was walking through. The 8 corridors and 19 intersections were mapped in two different directions with a Google's Nexus 4 and Android 5.0.1. As a result, there were 54 different alternative paths. Sampling on every path was repeated 5 times, so the database designed for training purposes is composed by a total of 270 different continuous samples.
We used Android devices since they allow full access to sensors and they dominate the mobile phones market with, approximately, 78% of share.
Our mapping process captured the data coming from three different sensor sources: magnetometer, accelerometer and rotation sensor. The first source provided the raw data of the magnetometer sensor in the three axes [X, Y, and Z]. The second source came from the raw data of the accelerometer also in the three axes minus the gravity force. The last one represented the orientation as the angle of rotation in the three axes. User was moving when capturing data from a starting point to an ending point, and data were collected at every 0.1s. So continuous magnetic fingerprints were stored. Each continuous sample contains the coordinates of initial and end points, and also the coordinates of all turning points when capturing intersections. Moreover it contains n discrete captures, each one with the 9 above-mentioned features plus the timestamp. With the initial/turning/end positions and the timestamps it is possible to calculate the position of the discrete samples since the user's speed was almost constant while capturing the magnetic field values.
The mapping process was performed with an Android application that has direct access to sensors' data. The user's role in the application is to indicate in which zones is going to be performed the data capture process. Initially the application shows a map centered into the users current approximately location provided by the GPS sensor. Then, the user draws the trajectory that wants to follow to capture the data (see Fig. 7-A). This trajectory can consist of a path in a single corridor or in several ones. The user needs to be placed in the starting point of the route and then, after clicking the "Start Recording" button, the app starts to collect data until the user reaches the ending point and clicks the "Tap-at-End" button (See Fig. 7-C). In case of a multi-corridor path, the user has to press the "Tap at Turning i" button to indicate that they are placed at the i-th intersection (see Fig. 7-B).
A B C Fig. 7. Map of the Lab where samples were taken and Three screenshots of the Android application used to capture the data. A: Shows the path in where the data capture is going to be done and the "Start Recording" button. B&C: The current segment where the user is walking is highlighted in red, the user has to press the button (Turning in B or End in C) when she/he arrives to the 1-st intersection (B) or the final destination (C).
For testing purposes, 9 complex routes (see Fig. 8) along the laboratory were mapped. Each of these routes goes from different corridors and performs different trajectories. Two of them were mapped with two different mobile devices, the above-mentioned Nexus 4 and a LG G3 Smartphone with Android 5.0. So, a total of 11 complex continuous samples are available for testing purposes.
Our approach provides a geo-magnetic database, which contains information about continuous recordings from one or two corridors (training) and multiple corridors (testing). The data stored in each sample is proportional to the amount of time needed to complete an established path, due to sampling period of 0.1 seconds. So, the data provided by the accelerometer, magnetometer and the orientation of the device is stored 10 times per second. E.g., if it takes 12 seconds to map a corridor, the corresponding continuous sample will have 1200 values (12 s. x 10 discrete captures x 10 features).
Please note that the 11 testing trajectories are complex and were taken in more than one corridor. Although the 8-th and 9th trajectories are placed in a single corridor scenario, they may also be considered multi-corridor since a U-turn (180º) is done on them. In those two trajectories, two different directions in the corridor are considered, so they cannot be considered pure single-corridor trajectories.
Due to the complexity of data recorded, each training and testing continuous sample has been stored as an independent text file, whose description is detailed in Section IV-B.
The continuous mapping we have performed may provide an accurate positioning. All the paths, intersections and turnings have been mapped with very high precision. Moreover, knowing that a person in normal conditions can cover a distance of 1.39 m. per second, our approach captures data approximately at every 0.139 m. that means that the accuracy over the path is very high. In UJIIndoorLoc-Mag, the users are walking at a normal speed through single and multicorridor trajectories without any obstacle. Although the research group members and researchers were present in the office, nobody stood in the corridor.

B. Description of database files
The database consists of 281 continuous samples, 270 are for training and 11 for testing. They have been stored as independent text files. The training ones are grouped into two main categories "lines" and "curves". The "lines" group has 80 files and they stand for the single corridor case. The format for filename is "lXX_ZZ.txt" where XX stands for the number of corridor & orientation (n or r) and ZZ stands for repetition. Example: l3r_03.txt The "curves" group has 190 files and they stand for all possible trajectories considering two connected corridors only. The format for that group's filename is "cXXYY_ZZ.txt" where XX and YY stand for the number of corridor & orientation for the first and second corridors in the two corridors trajectory, and ZZ stands for repetition. Example: c5n1r_05.txt The testing files' filename format is "ttPP.txt" where PP stands for the complex testing trajectory number (see Fig. 8 Where n is the number of samples collected in the trajectory at a 0.1 seconds frequency and m is the number of segments (corridors) in the trajectory. Each sample contains the timestamp ts and the values from magnetometer, accelerometer and orientation sensors in the three axes, which are denoted with mx, my, mz, ax, ay, az, ox, oy and oz. According to the previous structure, the text files are composed by two well-differentiated parts separated by the row indicating the number of segments in the trajectory: 1) the sequence of discrete samples taken during the trajectory mapping, and 2) the configuration data. Where latitude and longitude coordinates have been truncated to 5 decimals for representation purposes. Three segments compose this particular example, so the number of intermediate points (intersections) is four. The mapped length of the first and second segments is similar, and the third segment's length is slightly lower.

V. BASELINE
Two very simple baseline methods have been developed and tested to provide a starting point that any more sophisticated indoor localization algorithm should be able to overcome.
The first one uses a discrete method to obtain the position of the discrete test points obtained from the continuous test samples. The second one uses a continuous method that obtains the position of the user taking into account 5 seconds of data instead of simple discrete samples. Both algorithms only use the training samples taken on the 8 corridors and from the magnetometer. The 190 two-corridor continuous samples were not used for training purposes in the baselines.

A. Discrete method
For each continuous sample, the localization of each discrete capture can be easily estimated since the coordinates of the initial and final points of the path are known, the timestamps were recorded and the user velocity was almost constant.
All the discrete captures extracted from the continuous training samples of the corridors are used as the training dataset, where each element consists of 5 features: the location where the capture was taken [lat, lon] and the measurement obtained by the magnetometer in this location [mX, mY, mZ]. The same procedure has been performed to extract the discrete captures from the test paths. In total, there are 8943 samples for training and 4380 for testing.
The k-NN algorithm [11] with k = 1 has been used to estimate the location of each test sample, so the test current location would correspond to the most similar train sample. The location of the most similar sample in the training set is the one assigned to the test sample. Although other distance or similarity metrics could have been used [12,13], the distance between two samples, m1 = [mX,1, mY,1, mZ,1] and m2 = [mX, 2, mY,2, mZ,2], corresponds to the Euclidean's distance and it is estimated as follows: (1) Table II shows the baseline results for the discrete method. The error in positioning corresponds to the mean distance between actual position and predicted position. This distance between two points does not correspond to the Euclidean's distance between them since the points corresponds to the latitude (lat) & longitude (lon) coordinates in decimal degrees, they are not expressed in linear meters. So, the haversine formula, eq.2, is used instead. The standard error of the mean is also shown in the table.
(2) Where R is the radius of Earth, 6373 km approximately, and: In general, the mean error in positioning using the discrete method is 7.23 ± 0.38 m. This general error has been calculated considering the mean results in the 11 testing paths of Table I.

B. Continuous method
For the continuous case, each continuous training sample is divided in several subsamples of 5 seconds each one. For instance, if a sample is 10 seconds long and has 100 discrete samples, then it is divided in 6 continuous subsamples, , , ..., . Each overlapping subsample includes information about the location of the initial and final point of the sub-path, and the 50 captures of the three components of the magnetic field measured.
All the subsamples extracted from the training samples of the corridors are used as the training dataset. The test samples are also divided in subsamples of 5 seconds. All the subsamples extracted from the test paths are used as the test dataset. In total, there are 540 subsamples for training and 231 for testing. For each test subsample, a 1NN-based method (similar to the one introduced for the discrete case) is performed to look for the more similar training subsample.
The distance between two continuous subsamples vm1 = [vmX, 1, vmY,1, vmZ,1] and vm2 = [vmX,2, vmY,2, vmZ,2] is also based on the Euclidean's distance, and it is given by the following equation:  (3) where vm[i] is the i-th element of the vector vm, d corresponds to Euclidean's distance (see eq.(1)), and N is the number of discrete captures of each continuous subsample. In our case, N=50 since each continuous subsample contains 50 discrete samples. Table II also shows the baseline for the continuous method similarly than for the discrete method. In this case, the mean error in positioning (considering the 11 different testing paths) is lower: 6.05 ± 0.43 m.