Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

Data plays a key role in the design of expert and intelligent systems and therefore, data preprocessing appears to be a critical step to produce high-quality data and build accurate machine learning models. Over the past decades, increasing attention has been paid towards the issue of class imbalance and this is now a research hotspot in a variety of ﬁelds. Although the resampling methods, either by under-sampling the majority class or by over-sampling the minority class, stand among the most powerful techniques to face this problem, their strengths and weaknesses have typically been discussed based only on the class imbalance ratio. However, several questions remain open and need further exploration. For instance, the subtle differences in performance between the over-and under-sampling algorithms are still under-comprehended, and we hypothesize that they could be better explained by analyzing the inner structure of the data sets. Consequently, this paper attempts to investigate and illustrate the effects of the resampling methods on the inner structure of a data set by exploiting local neighborhood information, identifying the sample types in both classes and analyzing their distribution in each resampled set. Experimental results indicate that the resampling methods that pro-∗


Introduction
Organizations are nowadays focused on exploiting the vast amounts of data generated from many sources and with multiple formats for competitive advantage. To this end, expert and intelligent systems are developed to make decisions based on insights extracted from the data sets. Since the potential of these systems relies on the quality of data, preprocessing becomes one of the most critical and effort-inducing stages in their development.
In many real-life applications, the data sets are typically imbalanced, which has been described as a challenging problem and the subject of several research efforts. A binary data set is said to be imbalanced if one of the classes is represented by a very small number of examples compared to the other class. By convention, the examples of the minority class are labeled as positive and those of the majority class are called negative.
It has been observed that class imbalance may cause an important deterioration of the performance attainable by most standard classifiers because they are strongly biased towards the classification of the negative examples and are not competent enough to classify the minority class correctly (Branco et al., 2016). However, the poor accuracy of existing models on positive examples could be attributed not only to class imbalance but also to a variety of factors, such as noisy data, class overlapping, lack of density and small disjuncts (He & Garcia, 2009;López et al., 2013). This means that the class imbalance may be not a problem by itself and countering class imbalance will not always lead to an improvement in performance (García et al., 2008;Japkowicz, 2003).
A large number of strategies have been proposed to deal with the class imbalance problem, which can be mainly grouped into three categories (Haixiang et al., 2017;Krawczyk, 2016). One is to assign distinct costs to the misclassifications on each class in such a way that an error made on the minority class will be more costly than an error made on the majority class. The second strategy is to preprocess the imbalanced data, either by enlarging (over-sampling) the minority class and/or shrinking (under-sampling) the majority class until the classes are approximately equally represented. The third group consists in internally biasing the discrimination-based process to compensate for the class imbalance.
The resampling techniques have probably been the most investigated because they are independent of the underlying classifier and can be easily implemented for any problem (Estabrooks et al., 2004;García et al., 2016;Weiss, 2004). The level of imbalance is reduced in both over-and under-sampling algorithms, with the assumption that a more balanced set should provide better classification results. However, these methods also present some weaknesses due to the artificial alteration of the original distribution of classes. For instance, under-sampling may throw out potentially useful data (leading to information loss) and augment the variance of the classifier, while over-sampling increases the population of the data set by generating synthetic examples and increases the likelihood of overfitting and the computational burden of any learning model (Kang et al., 2017;López et al., 2013;Vuttipittayamongkol & Elyan, 2020;Wong et al., 2018).
Though conclusions about what is the most efficient resampling strategy for the class imbalance problem are divergent, many studies have reported that oversampling usually performs better than under-sampling (Bach et al., 2017;Batista et al., 2004;Prati et al., 2015;Van Hulse et al., 2007;Yin & Gai, 2015). These conclusions have been drawn from mere experimental comparisons of a collection of resampling techniques to evaluate their performance, while the reasons why over-sampling is generally superior to under-sampling have not been properly investigated. Moreover, many of those studies have considered the imbalance ratio (the ratio of the majority class size to the minority class size) as the unique data difficulty factor 1 , thus neglecting other relevant data characteristics that could help to explain the behavior of each of the three resampling strategies.
Taking into account the limitations just mentioned, the motivation of this paper is to provide further insight into the underlying causes of the apparent superiority of over-sampling. In pursuing this objective, the contribution of this present pa-per is a large-scale experimental analysis with 22 resampling methods across six articial data sets and 73 real-life data sets to understand the superiority of oversampling based on the distribution of safe and unsafe samples. To this end, we address the following questions: (i) What effect do the resampling algorithms have on the inner structure of the class-imbalanced data sets?, (ii) Can the superiority of over-sampling algorithms be explained in terms of safe and unsafe samples?, (iii) Does there exist a close link between the amount of safe and unsafe and the performance of the strategies?
Unlike the common procedure that focuses only on the minority class, we assume that the majority one also deserves to be analyzed because the distribution of negative samples may provide meaningful information. Our hypothesis is that over-sampling often outperforms under-sampling because the former leads to a distribution of sample types with more safe examples and less unsafe cases than the latter. Hopefully, this will allow us to expand our understanding of how the performance of the resampling strategies is related to their effects on the structure of a data set. The findings of this study can serve as a valuable guideline to design expert and intelligent systems for many real-life applications that have to deal with class-imbalanced data such as fraud detection, cancer malignancy grading, fault detection in industrial machinery and software defect prediction, among many others.
Henceforward, this paper is organized as follows. Section 2 summarizes a pool of works concerned with analyzing the possible relationships between class imbalance and other data difficulty factors. Section 3 provides a summary of representative resampling techniques, which will be further used for the experimental analysis. Section 4 presents a neighborhood-based categorization of the different sample types that can be found in an imbalanced data set. Next, Section 5 describes the research methodology that we have adopted to conduct this study and presents the thorough experimentation carried out. Finally, Section 6 remarks on the main findings and outlines possible directions for further research.

Class imbalance and other data difficulty factors
As already remarked, the imbalanced distribution of classes itself is not the only data difficulty factor, but there exist other intrinsic data characteristics that combined with class imbalance can be even more critical and lead to a severe loss of classification performance, especially for the minority class. Das et al. (2018) proposed a categorization of the intrinsic data characteristics into two groups: (i) distribution-based data irregularities, and (ii) feature-based data irregularities. The first group covers class imbalance, outliers and noisy data, class overlapping, small disjuncts, data set shift and small data set size, whereas the second group includes missing, noisy, irrelevant and redundant features. Next, we summarize a representative collection of recent publications where the class imbalance appears as the intersection factor between both groups.

Distribution-based data irregularities
One of the first papers that intended to discover any links between class imbalance and data complexity corresponds to the one by Japkowicz & Stephen (2002), in which the authors concluded that imbalance is a relative problem that depends on both the difficulty of the data and the overall size of the training set. After this seminal work, numerous studies have explored the influence of other complexity factors in class-imbalanced data. For instance, Prati et al. (2004b) investigated how class imbalance and error-prone small disjuncts are related to each other, whereas  claimed that the degradation of classification accuracy is more due to the presence of small disjuncts than to the class imbalance problem. A similar conclusion was drawn by Weiss (2010), who also showed that class imbalance is partly responsible for the problem with small disjuncts. Prati et al. (2004a) showed that there exists a strong correlation between the degree of class overlapping and class imbalance. Similarly, the experimental results in two papers by García et al. (2006García et al. ( , 2007 suggested that the local imbalance in the overlap region has an impact on the performance of classifiers stronger than the global imbalance, especially when there exists strong overlap and synthetic examples are generated with SMOTE. On the other hand, García et al. (2008) stated that the nearest neighbor classifier was more sensitive to the size of the class overlap than to the overall imbalance ratio. Vorraboot et al. (2015) proposed some modified hybrid algorithms to improve the classification performance of highly imbalanced large data sets with overlapped regions.
Dal Pozzolo et al. (2015) showed that the benefits of using an under-sampling algorithm strongly depends on the number of samples, the variance of the classifier, the degree of imbalance and the value of the posterior probability. García et al. (2015) compared the behavior of three linear classifiers modeled on both the feature space and the dissimilarity space when the class imbalance of data sets interweaved with small disjuncts and noise; they showed that small disjuncts could be much better overcome on the dissimilarity space than on the feature space, whereas noise in imbalanced data sets cannot be completely solved through the dissimilarity-based representation. Luengo et al. (2011) evaluated the behavior of three resampling methods (SMOTE, SMOTE-ENN, and an evolutionary under-sampling algorithm) by using three data complexity measures (F1, N4, and L3) (Ho & Basu, 2002) computed over the imbalanced data sets and then, they derived two descriptive rules to identify the data sets in which the C4.5 and PART decision trees could perform well. Napierala et al. (2010) analyzed how the noisy and borderline positive examples hindered the classification performance and concluded that focused preprocessing methods outperformed both random and cluster-based over-sampling algorithms. Stefanowski (2013) observed that the degradation of classification performance was more related to the decomposition of the minority class into small sub-groups than to the class imbalance, and also that the amount of borderline and rare examples in the minority class had an even stronger influence on the classifiers. Sáez et al. (2016) proposed a general methodology to decide which types of positive samples should be processed by an over-sampling algorithm when facing with multi-class imbalanced distributions; the types of samples were characterized by using the local neighborhood-based procedure that will be further introduced in Section 4. Following the same line, Skryjomski & Krawczyk (2017) analyzed the structure of the minority class to transform the SMOTE algorithm into a selective over-sampling method focused on certain types of positive examples. Using two artificial data sets with different dimensions and imbalance ratios, Wojciechowski & Wilk (2017) found out that the critical factor affecting the true-positive rate was the distribution of sample types, while the impact of dimensionality and imbalance ratio was limited. Similarly, Stefanowski (2016) concluded that the performance of the most representative preprocessing approaches depends on the dominating type of minority examples.

Feature-based data irregularities
Bak & Jensen (2016) studied the imbalance problem concerning the classification of high-dimensional binary data. Blagus & Lusa (2013) observed that SMOTE (Synthetic Minority Oversampling TEchnique) did not alleviate the bias towards the classification in the majority class when the imbalanced data set was also high-dimensional. Wasikowski & Chen (2010) showed that feature selection could tackle the class imbalance problem better than some preprocessing algorithms in high-dimensional data sets. Tomasev & Mladenić (2013) suggested that minority class hubs might be responsible for most misclassifications of the majority class in high-dimensional imbalanced data sets. Zheng et al. (2004) investigated the usefulness of common feature selection metrics (information gain, chi-square, correlation coefficient, and odds ratios) to handle imbalanced data. Van Hulse & Khoshgoftaar (2009) discussed the effect of noise resulting from the corruption of positive examples, which was the type of noise with most deterioration of the classification performance; moreover, they observed that simple classifiers such as naive Bayes and nearest neighbor were often more robust than more complex models such as support vector machines or random forests. Zhang et al. (2017) argued that the problems of high-dimensional data and imbalance are intertwined, and therefore they should not be solved separately. Lin & Chen (2013) reported the benefits of using some feature selection algorithm as a previous step to the application of the SMOTE over-sampling technique. Other authors, however, proposed first to resample the data set and then apply a feature selection procedure (Lachheta & Bawa, 2016). Yin et al. (2013) studied the difficulties of feature selection when applied to high-dimensional imbalanced data with Bayesian learning, and proposed two new algorithms to overcome the drawbacks: one is based on the decomposition of the majority classes into relatively smaller sub-classes, whereas the other one uses the Hellinger distance. Maldonado et al. (2014) proposed a feature selection technique using support vector machine and backward elimination in the context of high-dimensional imbalanced data sets. Viegas et al. (2018) developed a feature selection strategy for high-dimensional skewed data using genetic programming. Shahee & Ananthakumar (2019) introduced a distance-based feature selection method in order to tackle simultaneous occurrence of between-class and within-class imbalance.

The resampling techniques
This section presents the resampling algorithms that will be used in the experiments. As pointed out in Section 1, the resampling methods can be grouped into two main categories: under-sampling and over-sampling. In addition, some hybrid techniques combine the general ideas of under-and over-sampling to transform the skewed class distribution into a more balanced distribution. Table 1 summarizes these algorithms, which are briefly described in Appendix A. Tomek (

Exploiting local neighborhood for the identification of sample types
When dealing with imbalanced data sets, a remarkable issue that deserves some special attention is the identification of the dominating types of examples because it can support interpretations of performance differences between the application of different resampling algorithms and can be useful in evaluating the data difficulty (Napierala & Stefanowski, 2012;Napierala et al., 2010;Stefanowski, 2016).
Several authors have proposed to distinguish two main types of samples according to their neighborhood: safe and unsafe (Kubat & Matwin, 1997;Napierala & Stefanowski, 2016;Sáez et al., 2016). The safe samples are placed in homogeneous regions with data from a single class and are sufficiently separated from examples belonging to any other classes, whereas the remaining samples are deemed unsafe. Most models classify the safe samples correctly, but the unsafe samples may make their learning especially difficult and more likely to be misclassified.
The common property of the unsafe samples is that they are located close to examples that belong to the opposite class. However, the unsafe samples can be further divided into three subtypes: borderline, rare and outlier (Krawczyk et al., 2014;Napierala & Stefanowski, 2016). The borderline samples are located closely to the decision boundary between classes. The rare samples form small data structures or clusters located far from the core of their class. Finally, the outliers are single samples that are surrounded by examples from the other class.
A straightforward method to identify each sample type consists of analyzing the local distribution of the data, which can be modeled either by computing their k-neighborhood or through a kernel function (this consists in setting a local area around the example and estimating the number of neighbors and their class labels within it). It has been claimed that analyzing a local distribution of examples is more appropriate than using global approaches because the minority class is often formed by small sub-groups with difficult, nonlinear borders between the classes (Napierala & Stefanowski, 2016;Sáez et al., 2016).
Suppose we have a data set, is a vector of attributes describing the i-th example and y i is its class label. Thus the type of a sample z i is often determined by comparing the number of its k nearest neighbors that belong to the class of z i with the number of neighbors of the opposite class. Following the procedure described in Algorithm 1, which is a generalization for multi-class data of the procedure proposed by Stefanowski & Wilk (2008), a safe sample is characterized by having a neighborhood dominated by examples that belong to its same class, rare samples and outliers are mainly surrounded by examples from different classes, and the borderline samples are surrounded by examples both from their same class and also from a different class.
Here we have introduced two functions: computeNeighbors and countSameClass. The first one searches for the k nearest neighbors of a sample z i and stores them in a vector named neighbors, while the second function counts how many of the k nearest neighbors belong to the class of z i .
Most authors choose a fixed size of k = 5 because smaller values may poorly distinguish the nature of examples and higher values would violate the assumption of the local neighborhood (Bagherpour et al., 2018;Błaszczyński & Stefanowski, 2015;Fernández et al., 2018a;Krawczyk et al., 2014;Napierala et al., 2010;Napierala & Stefanowski, 2012Ren et al., 2019;Sáez et al., 2016;Skryjomski & Krawczyk, 2017;Stefanowski, 2016;Tomasev & Mladenić, 2013). Moreover, Napierala & Stefanowski (2016) carried out a sensitivity analysis to check whether or not the parameter k could affect the results of assigning a sample type to the minority examples, and they observed that the proportion of each sample type was quite stable while changing the value of k. Thus, using k = 5, an example z i will be considered as: (i) safe if at least 4 neighbors are from the class y i ; (ii) borderline if 2 or 3 neighbors belong to the class y i ; (iii) rare if only one neighbor belongs to the class y i , and this has no more than one neighbor from end if 27: end for its same class; and (iv) outlier if all its neighbors are from the opposite class. A simple, illustrative example of this categorization is displayed in Figure 1.
The identification of the different sample types has mainly been applied to the minority class because this often constitutes the most important class for most applications with imbalanced data sets. However, the percentage of samples in each category for the majority and minority classes may differ heavily from each other and therefore, we believe that it could be useful and more informative to analyze the true distribution of sample types for both classes present in class- Figure 1: Example of sample types using the procedure given in Algorithm 1 imbalanced data. In this sense, the computation may resemble a means of data set evaluation that characterizes the overlap in terms of a scalar value. Considering that the class overlapping is defined as the data space where there exists a similar quantity of training samples of both classes (Chen et al., 2018;López et al., 2013), we argue that the presence of borderline samples (2 or 3 out of the 5-nearest neighbors belong to the same class) is closely related to the concept of overlapping and therefore, it seems possible to estimate the size of the overlapping regions by computing the proportion of borderline samples in a data set.

Experiments
Two groups of experiments on binary problems were carried out to investigate the effect of each of the three resampling strategies on the distribution of sample types in both classes, and also to discover any possible link between such distribution and the classification performance. The experiments in the first block were performed on artificial data sets taken from the paper by Napierala et al. (2010) because using synthetic data allows us to know their characteristics a priori and analyze the effects of resampling in a fully controlled environment. The second group of experiments was on a well-known benchmark suite of real-life databases widely used for class imbalance problems (Chen et  In binary classification problems, the quite common method for evaluating the predictive performance is based on a 2 × 2 matrix confusion as shown in Table 2. Here, columns represent the predicted class and rows indicate the actual class, whereas the main diagonal contains the number of correct predictions. For estimating the effectiveness of a classifier on the positive and negative classes separately, two plain metrics can be easily obtained: the true positive rate, T P R = T P/(T P +F N ), which is the proportion of positive examples correctly classified, and the true negative rate, T N R = T N/(T N + F P ), which is the proportion of negative examples correctly classified.
In the context of class imbalance problem, the performance evaluation is carried out using more powerful metrics derived from straightforward indexes. Some examples are the geometric mean (Kubat & Matwin, 1997;Branco et al., 2016;Fernández et al., 2018b), the F β −measure (Rijsbergen, 1979;Branco et al., 2016;Fernández et al., 2018b), and the area under the receiver operating characteristic curve (AUC) (Bradley, 1997;Branco et al., 2016;Fernández et al., 2018b). Although these performance metrics are used extensively under imbalanced domains, several studies have shown the limitations of these measures. García et al. (2014) have documented that the geometric mean shows an invariance behavior under the change of TP with TN and FN with FP. Therefore, different combinations of TPR and TNR may produce the same values of the geometric mean. The F β −measure combines into a single scalar value both TPR and precision (precision = T P/T P + F P ), where the β parameter favors precision when β > 1, and TPR otherwise. Even though β allows to adjust the importance of TPR or precision, the studies of Daskalaki et al. (2006), Japkowicz (2006, Sokolova &Lapalme (2009), andLandgrebe et al. (2006) have showed that precision ignores the relative size of the negative class and displays a strong dependence upon the imbalance ratio; hence, in heavily imbalance problems (1% positives samples), any raise of FP will result in low precision and consequently, in low F β −measure, even with high TPR values (Forman & Scholz, 2010). In the case of AUC, there may exist situations that produce the same AUC value but different accuracies (Huang & Ling, 2005). Hand & Till (2001) and Hand (2009) also have reported some limitations of the AUC such as the fact that it ignores misclassification costs and assumes that these costs depend on the classifier.
Bearing in mind that this paper aims to analyze the effects of the resampling methods on each class and that each performance measure evaluates different properties, here we will use the both straightforward TPR and TNR indexes.

Experiments with artificial data sets
The experiments on artificial data were conducted on three databases with different shapes of the minority class (subclus, clover, and paw) whose examples are randomly and uniformly distributed in a two-dimensional feature space. In all cases, the examples of the minority class are uniformly surrounded by the majority class.
In subclus, the positive examples are located inside rectangles that form small disjuncts. Clover represents a more complex, non-linear situation, where the minority class resembles a flower with elliptic petals. In paw database, the minority class is decomposed into three elliptic sub-regions of varying cardinalities, where two sub-regions are located close to each other, and the remaining smaller subregion is separated.
From the multiple data sets that were generated with different settings in the original paper (Napierala et al., 2010), we chose a group of databases with 800 examples, an imbalance ratio of 7, and two different levels of noise (0% and 70%). This means that the experiments were carried out over a total of 6 artificial data sets (3 shapes × 1 imbalance ratio × 2 levels of noise), which are illustrated in Figure 2.
The experiments consisted of applying the resampling techniques described in Section 3 to the original data sets and record the proportion of each sample type for both the minority class and the majority class. This will allow to analyze how each strategy affects the distribution of sample types in a data set, which may contribute to gain some insight into the behavior of these techniques when they are used in imbalanced data sets that are also characterized by other data difficulties, such as the presence of noisy samples that can largely impair the predictive results of classifiers.  These scatterplots reveal that the three resampling strategies increased the proportion of safe samples and decreased the percentage of unsafe samples in the positive class when compared to the imbalanced data sets. What is more interesting though is that, for the databases with 70% of noise (S 70 , C 70 and P 70 ), the over-sampling and hybrid techniques achieve a higher (lower) proportion of safe (unsafe) samples than under-sampling. While all over-sampling algorithms augmented the number of safe samples and diminished the number of unsafe samples very substantially, the proportions of safe, borderline and rare samples produced by some under-sampling methods were even worse than those in the original data sets. Regarding the hybrid techniques, these and over-sampling were not distant, except in the case of the proportion of safe and borderline samples given by SPI-DER whose results were similar to those achieved by the under-sampling strategy.
When analyzing the proportion of each sample type in the negative class, the graphs in Figures 5-6 show that the proportion of safe samples after resampling the data sets using the over-sampling and hybrid algorithms was not far from that in the original data sets. Here the proportion of unsafe samples produced by these methods increased, especially in the case of the rare and outlier types. The undersampling techniques usually performed in an unstable behavior, serious decrease of safe samples and an evident increase of borderline and rare.

Classification of the artificial data sets
The results of the experiments on the proportion of each sample type identified under-sampling as an inferior choice to make up for the class imbalance, especially for the data sets with a large proportion of noisy examples (70%). This resonates well with the general conclusions drawn from numerous comparative studies available in the literature, which designate over-sampling as a usually more effective strategy than under-sampling.
To fairly assess whether or not there exists any link between the proportions of , while other classifiers such as neural networks are generally perceived as being a black box whose specific predictions are extremely hard to understand. Visualizing TPR and TNR in Figure 7 and comparing these graphs with those in Figures 3 and 5 can help us discover and interpret the possible relationships between the structure of resampled data sets and the performance of the classifier.
Our discussion of Figure 7 focuses on the results over the data sets with 70% of noisy examples because these represent a more challenging problem combining imbalance and noise. As can be observed, when the classifier was applied to the class-imbalanced data, the TPR was 0 or close to 0 (i.e., all or almost all the positive samples were misclassified) and the TNR was equal to 1 (i.e., all negative samples were classified correctly). The most interesting feature of these graphs, however, is that both over-sampling and the hybrid sampling algorithms exhibited a good trade-off between high TPR and high TNR, whereas some under-sampling techniques produced high TNR but at the cost of yielding very low values of TPR (even less than 0.5).
In summary, the graphs in Figure 7 confirm that the performance of classifiers is related to the proportions of safe and unsafe samples, and these depend on the resampling strategy applied to the class-imbalanced data. A qualitative comparison between these graphs and those in Figures 3-6 suggests that over-sampling mostly performs better than under-sampling because the former increases the pro-Figure 7: TPR and TNR over the artificial databases. Graphs on the right are for the averaged values portion of safe samples and also decrease the proportion of unsafe samples much more important than the latter does.
A close look at these scatterplots shows that the discussion of the results for the synthetic data also apply to those for the real-life databases. Indeed, as one can observe in Figure 8, the proportion of safe samples in many sets that were preprocessed by some under-sampling algorithms was even inferior to that in the original data sets. Similarly, the amount of unsafe samples in many under-sampled data sets was greater than that in the original data sets. As to over-sampling (Figure 9) and the hybrid strategy (Figure 10), the graphs show that most algorithms increased the number of safe samples and also decreased the proportion of unsafe samples, which is especially remarkable for the positive class.
To summarize the results of the graphs in Figures 8-10, we averaged the proportions of each sample type over all algorithms for each resampling strategy. The most interesting features of the graphs depicted in Figure 11 is that undersampling produced a proportion of safe samples in both classes clearly lower than over-sampling and hybrid sampling, whereas the amount of unsafe samples was Figure 9: Proportion of sample types in the positive (left) and negative (right) classes for the real-life databases preprocessed by over-sampling algorithms higher in the under-sampled sets than in the sets preprocessed by the other two resampling strategies.
As further evidence, Table 4 reports an index of improvement. For each resampling algorithm A, the index of improvement is calculated as the difference between wins and losses, where wins (losses) is the total number of times (databases) that the proportion of samples produced by A has been better (worse) than that in the original data set. Note that better means that the proportion of safe samples in the resampled data set is higher than that in the original data set, while for the unsafe sample types it means that the proportion of samples in the resampled data set is lower than that in the original data set. Such an index provides a means of estimating the benefits of using a resampling technique to face the imbalance problem. For each resampling strategy, the averaged index across all their algorithms has also been included in this table.  Table 4 shows that the over-sampling strategy produced the best outputs for the safe and borderline types, whereas the hybrid methods achieved the highest averaged index when analyzing the proportion of rare and outlier samples. Nevertheless, the superiority of the hybrid techniques over the over-sampling methods came from the poor behavior of the AHC algorithm in processing the unsafe samples. As already observed in the experiments with synthetic data, under-sampling was the worst strategy regarding the improvement of the balanced data over the original (imbalanced) data, revealing that it yielded the lowest proportion of safe samples and also the highest proportion of unsafe samples.
For the majority class, most algorithms achieved a negative score of the index of improvement, which means that the balanced data sets consist of less safe samples and more unsafe samples than the original data sets. Note that this result is Figure 11: Proportion of sample types in the positive (left) and negative (right) classes for the real-life databases averaged over all algorithms consistent with the ultimate objective of the resampling techniques as they mainly concentrate on improving the minority class.
In summary, the numerical indices of improvement agree with the results depicted in the scatterplots of Figures 8-10. On the other hand, the conclusions drawn from the experiments over the real-life data closely resemble those reached in the experiments over the synthetic databases.

Classification of the real-life data sets
Like in the experiments on the synthetic data, a C4.5 decision tree was applied to both the imbalanced and the resampled data sets to check for any link between the proportions of sample types and the resulting classification performance.
As the characteristics of the 73 experimental databases may differ from each other considerably, we firstly categorized them into three groups according to the prevalent type of positive samples in the original data sets (see Appendix B): safe, borderline, and rare-outlier (databases in which the positive samples are mainly placed between the rare and the outlier types). The purpose of this categorization was to better understand the behavior of the resampling strategies as a function of the distribution of sample types in the imbalanced data sets. The scatterplots of TPR versus TNR are displayed in Figures 12-14. The uppermost graphs correspond to the results achieved with under-sampling, the middle ones are for the over-sampling algorithms, and the lowermost ones are for the hybrid methods.
The graphs in this figure reveal that the over-sampling algorithms and the hybrid sampling methods performed similarly, irrespective of the prevalent type of positive samples. As already highlighted in the experiments on artificial data, both strategies led to a good trade-off between high TPR and high TNR with Figure 12: TPR versus TNR over the safe data sets points located to the top right corner of the scatterplots for the safe and borderline databases. We observed, however, a different behavior pattern for the rare-outlier databases: in this case, the over-sampling and hybrid techniques still achieved very high values of TNR, but also an important degradation of accuracy on the positive class with a majority of points lying on the left side of the graphs (i.e., TPR ≤ 0.5).
Regarding the scatterplots for the under-sampling strategy, we found pretty different behaviors among methods. For the safe databases, a majority of points are located close to the top right corner of the graph (high TPR and high TNR), but a few points lie near the bottom right corner (high TPR and very low TNR). This behavior was similar to that shown for the borderline databases, although both TPR and TNR were usually lower than those achieved for the safe databases. For the rare-outlier databases, one can see that the results of under-sampling were worst than those of the over-sampling and hybrid algorithms, with many points Figure 13: TPR versus TNR over the borderline data sets representing low TPR and low TNR.
In summary, these results reveal that there exist several links between the distribution of sample types produced by the resampling strategies and the classification performance, thus suggesting that the analysis of such a distribution is indeed a useful tool to understand the behavior of each preprocessing method. In general, the over-sampling and hybrid techniques can be claimed to be more effective than under-sampling, independently of the prevalent type of positive samples in the imbalanced data set. However, the most meaningful differences appeared when under-sampling was applied to the databases with a majority of rare and outlier samples, which correspond to the most difficult cases for standard classifiers.

Conclusions
Our motivation for this work came from the observation that many studies on class imbalance stated that over-sampling mostly performs better than under- Figure 14: TPR versus TNR over the rare-outlier data sets sampling, but the reasons for its superiority were not adequately addressed. Thus we have intended to increase understanding of the behavior of resampling strategies by analyzing the distribution of sample types in the balanced data sets. Our hypothesis was that the apparent superiority of over-sampling techniques comes from the fact that these provide a higher proportion of safe samples and a lower proportion of some subtypes of unsafe samples than the under-sampling methods.
The experiments to check whether or not our hypothesis holds have consisted in gathering the information related to the local neighborhood of both classes, calculating the proportions of each sample type and investigating for any links between these proportions and the classification performance of a decision tree. From the experiments over artificial and real-life data, we have found that the oversampling algorithms and the hybrid resampling methods increased the proportion of safe samples and also diminished the proportion of unsafe samples much more importantly than under-sampling did. We claim that this result is already impor-tant by itself because it suggests that classification with over-sampled data sets will be presumably easier and more effective than using under-sampled data sets.
When compared the resulting distribution of sample types with the classification performance measured by the true-positive and true-negative rates, we have observed that our hypothesis mostly holds. In general, the strategies with the highest proportion of safe samples and the lowest proportion of unsafe samples corresponded to those with the highest overall performance, which may indicate that there are some relationships between the proportions of safe and unsafe samples and the performance of the classifier.
We believe that the findings of this study can be of interest for the research community in expert and intelligent systems because it allows to gain a more indepth insight into the performance of resampling strategies for class-imbalanced data and expands the current knowledge about why over-sampling performs generally better than under-sampling. On the other hand, the conclusions drawn in this paper could provide support for the development of new preprocessing algorithms by incorporating some a priori knowledge about the internal structure of the imbalanced data sets. Another practical implication that could deserve to be further studied is the design of a meta-learning recommendation system for characterizing classification problems. This is based on the idea of using the categorization of examples as a means to guess the best performing algorithm according to the inner structure of each data set.
Despite its contributions, the results of this paper should not be interpreted without accounting some limitations that could be addressed in future works. First, the research has focused on the analysis of relatively small-sized data sets (at most 5472 examples and 34 features), and so any generalization is limited to this particular context. It would be useful to replicate this study when the number of examples is in the order of millions to billions and the number of features is in the order of thousands, where the boundary conditions are very different and much more complex. A second limitation is that the categorization of examples has been based on computing their k-neighborhood, but it would be worth comparing the results of this study with those given by the use of a kernel function. Finally, the emphasis of this paper has been on three common resampling strategies, but it could be extended to ensemble-based preprocessing methods such as RUS-Boost (Seiffert et al., 2010), SMOTEBoost (Chawla et al., 2003), EasyEnsemble (Liu et al., 2009) and SMOTEBagging (Wang & Yao, 2009), which have been shown to be among the most effective techniques in many real-life applications.

Appendix A. Resampling methods
This appendix provides a brief description of the resampling algorithms used in the experiments.  (Batista et al., 2004), which firstly finds a consistent subset and then applies the procedure based on the Tomek links.
Unlike the one-sided selection technique, the neighborhood cleaning (NCL) rule (Laurikkala, 2001) concentrates more on data filtering than on data reduction; to this end, Wilson's editing (ENN) (Wilson, 1972) is employed to identify and remove noisy negative examples. According to the authors, NCL performs better than OSS and processes noisy examples more carefully. However, this method is strongly biased in favor of the minority class and leads to poor specificity and overall accuracy. Yen & Lee (2006) presented an under-sampling algorithm based on clustering (SBC): it first clusters all the original examples into some clusters, and then selects an appropriate number of majority class samples from each cluster by considering the ratio of the number of majority class examples to the number of minority class examples in the cluster. On the other hand, Yoon & Kwek (2005) proposed the class purity maximization (CPM) algorithm, which intends to split the majority class into dense clusters. The idea is to determine majority examples that are far away from the decision boundary, that is, to find as many clusters of majority samples as possible that do not contain any positive example or at most very few minority examples.

Appendix A.2. Over-sampling
The simplest strategy to augment the minority class is random over-sampling (ROS), which corresponds to a non-heuristic method that balances the class distribution through a random replication of positive examples (Batista et al., 2004). Although effective, this method may increase the likelihood of overfitting since it makes exact copies of the minority class examples. Chawla et al. (2002) proposed the SMOTE algorithm, which generates artificial samples of the minority class by interpolating existing examples that lie close together. It first finds the k positive nearest neighbors for each minority class example and then, the synthetic examples are generated in the direction of some or all of those nearest neighbors. Depending upon the amount of over-sampling required, a certain number of examples from the k nearest neighbors are randomly chosen.
Although SMOTE has demonstrated to be an effective method for the class imbalance problem, it may overgeneralize the minority class as it disregards the distribution of majority class neighbors and consequently, the generation of synthetic examples may increase the overlapping between classes (Maciejewski & Stefanowski, 2011). In order to address this weakness in SMOTE, the resampling process can be altered to account for the class density around the minority class examples. For instance, the borderline-SMOTE algorithm (Han et al., 2005) consists of using only positive examples close to the decision boundary since these are more likely to be misclassified.
The Safe-Level-SMOTE algorithm (Bunkhumpornpat et al., 2009) calculates a "safe level" coefficient (sl) for each minority class example, which is defined as the number of other minority class examples among its k neighbors, to generate new synthetic examples close to safe regions. If the coefficient sl is equal or close to 0, such an example is considered as noise; if sl is close to k, then this example may be located in a safe region of the minority class.  modified the original SMOTE method by using the surrounding neighborhood concept when selecting the k positive neighbors of the minority class examples. The authors proposed three variations of the algorithm, each one based on a particular surrounding neighborhood realization (Sánchez & Marqués, 2002) for over-sampling the minority class: the nearest centroid neighborhood (NCN), the Gabriel graph (GG) and the relative neighborhood graph (RNG). He et al. (2008) introduced an adaptive synthetic over-sampling (ADASYN) approach for learning from imbalanced data sets. The rationale behind this algorithm is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, thus shifting the decision boundary to be more focused on those examples that are harder to learn.
The ADOMS algorithm proposed by Tang & Chen (2008) is based on generating artificial examples along the first principal component axis of local data distribution composed of a positive sample and its k nearest neighbors. When k = 1, the result of this method matches that of SMOTE.
Another exciting proposal for populating the minority class is based on the application of an agglomerative hierarchical clustering (AHC) algorithm (Cohen et al., 2006). It uses single-and complete-linkage in succession to vary the clusters produced. Then the clusters are gathered from all levels of the resulting dendrograms and their centroids are computed and concatenated with the original positive samples. This results in augmenting the number of positive examples to match the size of the negative class.

Appendix A.3. Hybrid resampling
Although SMOTE produces well-balanced class distributions, some other difficulties often present in skewed data sets are not solved. For instance, class overlapping appears to be a widespread situation: some negative examples may be located within the clusters of the minority class and some synthetic positive examples may encroach on the majority class clusters. To overcome this problem and create non-overlapped class clusters, Batista et al. (2004) proposed the SMOTE-ENN technique: it consists in applying the Wilson's editing algorithm to the over-sampled data set to remove misclassified examples of both classes.
Another straightforward hybridization technique is based on the combination of SMOTE with the Tomek links (Batista et al., 2004). This method (SMOTE-TL) removes positive and negative examples that form a link after over-sampling the data set through SMOTE. Stefanowski & Wilk (2008) introduced a selective preprocessing and resampling algorithm (SPIDER) that firstly preprocesses the data set to identify the safe and noisy examples. After this initial stage, all the noisy negative samples are removed, and the safe negative examples are kept. On the other hand, the minority class is modified according to one of the following three strategies: weak amplification, weak amplification and relabeling, and strong amplification.
SPIDER2 is an extension of the SPIDER, which consists of two phases to preprocess the majority class and the minority class respectively (Napierala et al., 2010). Firstly, it identifies the safe and unsafe (noisy and borderline) negative examples and then, it either removes or relabels the noisy samples. In the second phase, the algorithm identifies the positive examples taking into account the changes introduced in the data set during the first phase. Next, it replicates the noisy examples of the minority class. The only difference between this technique and SPIDER is that the latter processes both classes simultaneously.