On the suitability of resampling techniques for the class imbalance problem in credit scoring

Abstract In real-life credit scoring applications, the case in which the class of defaulters is under-represented in comparison with the class of non-defaulters is a very common situation, but it has still received little attention. The present paper investigates the suitability and performance of several resampling techniques when applied in conjunction with statistical and artificial intelligence prediction models over five real-world credit data sets, which have artificially been modified to derive different imbalance ratios (proportion of defaulters and non-defaulters examples). Experimental results demonstrate that the use of resampling methods consistently improves the performance given by the original imbalanced data. Besides, it is also important to note that in general, over-sampling techniques perform better than any under-sampling approach.


Introduction
The recent world financial crisis has aroused increasing attention of banks and financial institutions on credit risk assessment, converting this into a key task because of the heavy losses associated with wrong decisions. One major risk comes from the difficulty to distinguish the creditworthy applicants from those who will probably default on repayments. In this context, credit scoring has been identified as a crucial tool to evaluate credit risk, improve cash flow, reduce possible risks and make managerial decisions (Thomas et al, 2002;Abrahams and Zhang, 2008), and one of the most popular application fields for both data mining and operational research (Baesens et al, 2009).
In practice, the process of credit scoring can be deemed as a prediction problem where a new input sample (the credit applicant) must be categorized into one of the predefined classes (in general, 'good' applicants and 'bad' applicants, depending on how likely they are to default with their repayments) based on a number of observed variables or attributes related to that sample. The input to the model consists of a variety of information that describes sociodemographic characteristics and economic conditions of the applicant, and the prediction method will produce the output in terms of the applicant creditworthiness.
The most classical approaches to credit scoring are based on parametric statistical models, such as discriminant analysis and logistic regression. However, most recent research has been addressed to implement solutions with non-parametric methods and computational intelligence techniques: decision trees, artificial neural networks, support vector machines, evolutionary algorithms, etc.
From the many comparative studies carried out (Baesens et al, 2003;Huang et al, 2004;Xiao et al, 2006;Wang et al, 2011), it is not possible to claim the superiority of a method over other competing algorithms regardless of data characteristics. For instance, noisy samples, missing values, skewed class distribution and attribute relevance may significantly affect the success of most prediction models.
This paper focuses on one of the data characteristics that may have most influence on the performance of classification techniques: the imbalance in class distribution (Japkowicz and Stephen, 2002;Chawla et al, 2004;He and Garcia, 2009). While some complexities have been widely studied in the credit scoring literature (eg, attribute relevance), the class imbalance problem has received relatively little attention so far. Nevertheless, imbalanced class distribution naturally happens in credit scoring where, in general, the number of observations in the class of defaulters is much smaller than the number of cases belonging to the class of non-defaulters (Pluto and Tasche, 2006).
In this paper, we conduct an experimental study over real-life credit scoring data sets using seven resampling algorithms to handle the class imbalance problem and two well-established prediction models (logistic regression and support vector machine). All techniques are evaluated in terms of their area under the ROC curve (AUC), and then compared for statistical differences using the Friedman's average rank test and a post hoc test. The aim of this study is to determine whether or not the resampling strategies are suitable to deal with the class imbalance problem, and to which extent different levels of imbalance affect the performance of each method.

Related works
Class imbalance hinders the performance of most standard classification systems, which assume a relatively well-balanced class distribution and equal misclassification costs (Japkowicz and Stephen, 2002). The class imbalance problem occurs when one class vastly outnumbers the other class, which is usually the most important one and with the highest misclassification costs (Chawla et al, 2008). Instances from the minority and majority classes are often referred to as positive and negative, respectively.

Class imbalance in credit scoring
As already mentioned, imbalanced class distribution happens in many credit scoring applications. For example, it is common to find that defaulters constitute less than 10% of the database. This is the main reason why the class imbalance problem has attracted growing attention in the literature, both to detect fraudulent financial activities and to predict creditworthiness of credit applicants.
In the credit scoring domain, research has mainly focused on analysing the behaviour of prediction models, showing that the performance on the minority class drops down significantly as the imbalance ratio increases Kennedy et al, 2010;Bhattacharyya et al, 2011;Brown and Mues, 2012). However, only a few works have been addressed to design solutions for imbalanced credit data sets. For example, Vinciotti and Hand (2003) introduced a modification to straightforward logistic regression by taking into account the misclassification costs when the probability estimates are made. Huang et al (2006) proposed two strategies for classification and cleaning of skewed credit data. One method involves randomly selecting instances to balance the proportion of examples in each class, whereas the second method consists of combining the ID3 decision tree and the PRISM filter.
An algorithmic level solution corresponds to the proposal by Yao (2009), who carried out a systematic comparative study on three weighted classifiers: C4.5 decision tree, support vector machine and rough sets. The experiments over two credit scoring data sets showed that the weighted methods outperform those standard classifiers in terms of type-I error. Within the PAKDD'2009 data mining competition, Xie et al (2009) proposed an ensemble of logistic regression and AdaBoost with the aim of optimizing the AUC for a highly imbalanced credit data set. In the same direction of combining classifiers, Florez-Lopez (2010) employed several cooperative strategies (simple and weighted voting) based on statistical models and computational intelligence techniques in combination with bootstrapping to handle the imbalance problem. Kennedy et al (2010) explored the suitability and performance of various one-class classifiers for several imbalanced credit scoring problems with varying levels of imbalance. The experimental results suggest that the oneclass classifiers perform especially well when the minority class constitutes 2% or less of the data, whereas the twoclass classifiers are preferred when the minority class represents at least 15% of the data. Tian et al (2010) proposed a new method based on the support vector domain description model, showing that this can be effective in ranking and classifying imbalanced credit data.
An exhaustive comparative study of various classification techniques when applied to skewed credit data sets was carried out by Brown and Mues (2012). They progressively increased the levels of class imbalance in each of five real-life data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of each technique was adversely affected. The results showed that traditional models, such as logistic regression and linear discriminant analysis, are fairly robust to imbalanced class sizes.

Resampling methods
Much work has been done to deal with the class imbalance problem, at both data and algorithmic levels. At the data level, the most popular strategies consist of applying different forms of resampling to change the class distribution of the data. This can be done by either over-sampling the minority class or under-sampling the majority class until both classes are approximately equally represented.
Both data level solutions present several drawbacks because they artificially alter the original class distribution. While under-sampling may result in throwing away potentially useful information about the majority class, over-sampling worsens the computational burden of some learning algorithms and creates noise that could result in a loss of performance (Barandela et al, 2003).
At the algorithmic level, solutions include internally biasing the discrimination-based process, assigning distinct costs to the classification errors and learning from one class. Conclusions about what is the best solution for the class imbalance problem are divergent. However, the data level methods are the most investigated because they are independent of the underlying classifier and can be easily implemented for any problem. Hence, the present study will concentrate on a number of resampling strategies.

Over-sampling
The simplest strategy to expand the minority class corresponds to random over-sampling, that is, a non-heuristic method that balances the class distribution through the random replication of positive examples. Nevertheless, this method may increase the likelihood of overfitting since it makes exact copies of the minority class instances.
In order to avoid overfitting, Chawla et al (2002) proposed a technique, called Synthetic Minority Oversampling TEchnique (SMOTE), to up-size the minority class. Instead of merely replicating cases belonging to the minority class, this algorithm generates artificial examples from the minority class by interpolating existing instances that lie close together. It first finds the k nearest neighbours belonging to the minority class for each positive example and then, the synthetic examples are generated in the direction of some or all of those nearest neighbours. SMOTE allows the classifier to build larger decision regions that contain nearby instances from the minority class. Depending upon the amount of over-sampling required, a number of neighbours from the k nearest neighbours are randomly chosen (in the experiments reported in the original paper, k was set to 5). When, for example, the amount of over-sampling needed is 200%, only two neighbours from the k nearest neighbours are chosen and then one synthetic prototype is generated in the direction of each of these two neighbours.
Although SMOTE has proved to be an effective tool for handling the class imbalance problem, it may overgeneralize the minority class as it does not take care of the distribution of majority class neighbours. As a result, SMOTE generation of synthetic examples may increase the overlapping between classes (Maciejewski and Stefanowski, 2011). Numerous modifications to the original SMOTE have been proposed in the literature, most of them pursuing to determine the region in which the positive examples should be generated. Thus, the Safe-Level SMOTE (SL-SMOTE) algorithm (Bunkhumpornpat et al, 2009) calculates a 'safe level' coefficient (sl ) for each example from the minority class, which is defined as the number of other minority class instances among its k neighbours. If the coefficient sl is equal or close to 0, such an example is considered as noise; if sl is close to k, then this example may be located in a safe region of the minority class. The idea is to direct the generation of new synthetic examples close to safe regions.
On the other hand, Batista et al (2004) proposed a methodology that combines SMOTE and data cleaning, with the aim of reducing the possible overlapping introduced when the synthetic examples from the minority class are generated. In order to create well-defined classes, after over-sampling the minority class by means of SMOTE, the Wilson's editing algorithm (Wilson, 1972) is applied to remove any example (either positive or negative) that is misclassified by its three nearest neighbours. This method is here called SMOTE+WE.

Under-sampling
Random under-sampling aims at balancing the data set through the random removal of examples from the majority class. Despite its simplicity, it has empirically been shown to be one of the most effective resampling methods. However, the major problem of this technique is that it may discard data potentially important for the prediction process. In order to overcome this limitation, other methods have been designed to provide a more intelligent selection strategy. For example, Kubat and Matwin (1997) proposed the One-Sided Selection technique (OSS), which selectively removes only those negative instances that are redundant or noisy (majority class examples that border the minority class). The border examples are detected by using the concept of Tomek links (Tomek, 1976), whereas the redundant cases (those that are distant from the decision boundary) are discovered by means of Hart's condensing (Hart, 1968).
Laurikkala (2001) introduced a new algorithm called Neighbourhood CLeaning rule (NCL) that operates in a similar fashion as OSS. In this case, Wilson's editing is used to remove majority class examples whose class label differs from the class of at least two of its three nearest neighbours. Besides, if a positive instance is misclassified by its three nearest neighbours, then the algorithm also eliminates the neighbours that belong to the majority class.
A quite different alternative corresponds to under-Sampling Based on Clustering (SBC) (Yen and Lee, 2006), which rests on the idea that there may exist different clusters in a given data set, and each cluster may have distinct characteristics depending on the ratio of the number of minority class examples to the number of majority class examples in the cluster. Thus the SBC algorithm first gathers all examples in the data set into some clusters, and then determines the number of majority class examples that will be randomly picked up. Finally, it combines the selected majority class instances and all the minority class examples to obtain a resampled data set.

Experiments
The aim of the experiments here carried out is to evaluate the performance of different under-and over-sampling algorithms and investigate to what extent the behaviour of each technique is affected by different levels of imbalance. On the other hand, we also analyse the suitability of each resampling method in function of the type of classifier when addressing the class imbalance problem. To this end, both statistical and artificial intelligence prediction models will be compared.
The resampling algorithms used in the experiments are the over-sampling and under-sampling techniques previously described in Section 2, that is, random oversampling (ROS), SMOTE, SL-SMOTE, SMOTE+WE, random under-sampling (RUS), OSS, NCL and SBC. The classification methods correspond to two well-known models suitable for credit scoring: logistic regression (logR) and support vector machine (SVM) with a linear kernel. All resampling techniques and both prediction models have been implemented with the KEEL software (Alcala´-Fdez et al, 2009), using their default parameters settings.

Description of the experimental databases
Five real-world credit data sets have been taken to test the performance of the strategies investigated in the present paper. The widely used Australian, German and Japanese data sets are from the UCI Machine Learning Database Repository (http://archive.ics.uci.edu/ml/). The UCSD data set corresponds to a reduced version of a database used in the 2007 Data Mining Contest organized by the University of California San Diego and Fair Isaac Corporation. The Iranian data set (Sabzevari et al, 2007) comes from a modification to a corporate client database of a small private bank in Iran.
As we are interested in analysing the impact of different levels of class imbalance on resampling and classification algorithms, each original set has been altered by randomly under-sampling the minority class in order to construct six data sets with varying imbalance ratios (the ratio of the number of minority class examples to the number of majority class examples), iRatio ¼ {1:4, 1:6, 1:8, 1:10, 1:12, 1:14}. Table 1 reports a summary of the main characteristics of the benchmarking data sets. As can be seen, the Iranian data set has not been modified because of its extremely high imbalance ratio, and it may be interesting to study the behaviour of the resampling techniques under this hard condition. Therefore, we have obtained a total number of 25 data sets for the experiments.

Experimental Protocol
The standard way to assess credit scoring systems is to use a holdout sample since large sets of past applicants are usually available. However, there are situations in which data are too limited to build an accurate scorecard and therefore, other strategies have to be used in order to obtain a good estimate of the classification performance. The most common way around this corresponds to crossvalidation (Thomas et al, 2002, Ch. 7).
Accordingly, a five-fold cross-validation method has been adopted for the present experiments: each data set in Table 1 has been randomly divided into five stratified parts of equal (or approximately equal) size. For each fold, four blocks have been pooled as the training data, and the remaining part has been employed as an independent test set. Ten repetitions have been run for each trial, giving a total of 50 pairs of training and test sets. Each resampling technique has been applied to each training set, thus obtaining the resampled data sets that have then been used to build the prediction models (logR and SVM). The nonpreprocessed training sets have also been employed for model construction. The results from classifying the test samples have been averaged across the 50 runs.

Evaluation criteria
Standard performance evaluation criteria in the fields of credit scoring include accuracy, error rate, Gini coefficient, Kolmogorov-Smirnov statistic, mean squared error, area under the ROC curve, type-I error and type-II error (Thomas et al, 2002;Yang et al, 2004;Hand, 2005;Abdou and Pointon, 2011). For a two-class problem, most of these metrics can be easily derived from a 2 Â 2 confusion matrix as that given in Table 2, where each entry (i, j) contains the number of correct/incorrect predictions. For consistency with previous works in the topic of performance measures, the positive and negative classes correspond to bad and good applicants (or credit risk), respectively. Most credit scoring applications often employ the accuracy (or the error rate) as the criterion for performance evaluation. It represents the proportion of the correctly (or wrongly) classified cases (good and bad) on a particular data set. However, empirical and theoretical evidences show that this measure is strongly biased with respect to data imbalance and proportions of correct and incorrect predictions (Provost and Fawcett, 1997). Besides, the accuracy ignores the cost of different error types (bad applicants being predicted as good, or vice versa).
To deal with the class imbalance problem in credit scoring applications, the area under the ROC curve (AUC) has been suggested as an appropriate performance evaluator without regard to class distribution or misclassification costs (Baesens et al, 2003) and correspondingly, this has been the evaluation measure adopted for the experiments. For a binary problem, the AUC criterion defined by a single point on the ROC curve is also referred to as balanced accuracy (Sokolova and Lapalme 2009): where sensitivity ¼ TP/(TP+FN) measures the percentage of positive examples that have been predicted correctly, whereas specificity ¼ TN/(TN+FP) corresponds to the percentage of negative instances predicted as negative.

Statistical significance tests over multiple data sets
Probably, the most common way to compare two or more classifiers over various data sets is the Student's paired t-test, which checks whether the average difference in their performance over the data sets is significantly different from zero. However, this appears to be conceptually inappropriate and statistically unsafe because parametric tests are based on the usual assumptions of independence, normality and homogeneity of variance, which are often violated due to the nature of the problems (Demsˇar, 2006;Zar, 2009;Garcı´a et al, 2010). In general, the non-parametric tests should be preferred over the parametric ones because they do not assume normal distributions or homogeneity of variance. In this work, we have adopted the Friedman test to determine whether there exist significant differences among the strategies. The process starts by ranking the algorithms for each data set independently according to the AUC results: as there are nine competing strategies, the ranks for each data set will be from 1 (best) to 9 (worst). Then the average rank of each algorithm across all data sets is computed. Under the null hypothesis, which states that all strategies are equivalent and so their average ranks should be equal, the Friedman statistic is distributed according to the w F 2 distribution with KÀ1 degrees of freedom, K being the number of algorithms.
The Friedman test only can detect significant differences over the whole set of comparisons. For this reason, if the null hypothesis of equivalence of average ranks is rejected, we can then proceed with a post hoc test. In particular, the Nemenyi test, which is analogous to the Tukey test for ANOVA, states that the performances of two or more algorithms are significantly different if their average ranks are at least as great as their critical difference (CD) with a given level of significance (a): where N denotes the number of data sets and q a is a critical value based on the Studentized range statistic divided by ffiffi ffi 2 p (Hochberg and Tamhane, 1987;Demsˇar, 2006).

Results and discussion
To better understand the effect of the class imbalance ratio on the performance of the eight resampling algorithms using the logR and SVM models, we have divided the data sets into two groups: strongly imbalanced databases (iRatioX10) and those with a low/moderate imbalance (iRatioo 10).

Low/moderate imbalance ratio
Tables 3 and 4 report the AUC values for the data sets with a low/moderate imbalance ratio when using the resampling techniques with the logistic regression and SVM classification models, respectively. The Friedman average ranking of the algorithms (K ¼ 9) over the data sets (N ¼ 12) at a significance level of a ¼ 0.05 is also provided for each classifier, showing that the prediction results using the resampled sets are better than those with the original imbalanced data (except for the SBC algorithm, which achieves the worst AUC values independently of the classifier used). In general, the over-sampling algorithms outperform the under-sampling techniques, what can be seen by either analysing the average rankings or comparing the AUC of each algorithm over each data set. The best resampling methods correspond to SMOTE þ WE, ROS and SMOTE, both with logistic regression and with SVM classifiers. It is also interesting to note that these algorithms usually perform better than the original imbalanced data (without resampling) even with a higher imbalance ratio; for example, in Table 3 the AUC using SMOTE over German (1:8) is 0.733, whereas the AUC over the original German (1:4) is 0.611. One can see that in many cases, this effect also happens when comparing over-sampling and undersampling.
When comparing the results given by logR in Table 3 with those of SVM in Table 4, it seems that the logistic regression model consistently performs better than the SVM approach, independently of the imbalance ratio. This finding is in agreement with the conclusions drawn in some previous studies (Baesens et al, 2003;Xiao et al, 2006;Kennedy et al, 2010). A Nemenyi post hoc test (a ¼ 0.05) has also been applied to report any significant differences between all pairs of algorithms. The results of this test are then depicted by significance diagrams (Lessmann et al, 2008), plotting the Friedman average ranks and the critical difference tail. The diagram plots resampling algorithms against average rankings, whereby all methods are sorted according to their ranks. The line segment to the right of each algorithm represents the critical difference (in this case, CD ¼ 3.468). The vertical dotted line indicates the end of the best performing method. Therefore, all algorithms right to this line perform significantly worse than the best method. Figure 1(a) displays the significance diagram for the logR model, where the best resampling technique has been SMOTE þ WE with an average rank of 2.333. As can be seen, this method is significantly better than using the original imbalanced data set or any under-sampling algorithm; only the NCL under-sampling approach is not significantly worse than the best performing technique. Note that even the random over-sampling algorithm with an average rank value of 2.667 is significantly better than the imbalanced data set, OSS and SBC.
In the case of the SVM, Figure 1(b) clearly shows that only the results of using the imbalanced data and the SBC method are significantly worse than those given by the best performing algorithm (random over-sampling with an average rank of 3.500). From this, it seems that the use of a linear kernel SVM produces non-significant differences in performance among most resampling techniques.

High imbalance ratio
Tables 5 and 6 provide the AUC values for the highly imbalanced data sets when applying the logR and SVM prediction models, respectively. The Friedman average ranking of the strategies (K ¼ 9) over the data sets (N ¼ 13) has also been included for each classifier. As can been seen, both under-sampling and over-sampling methods outperform the original imbalanced data set independently of the classifier used.
In the case of the logistic regression model, all oversampling algorithms perform better than the under-sampling techniques. The best performing approach corresponds to SMOTE þ WE with an average rank of 1.846, followed by SMOTE with 2.423 and ROS with 3.385. Although the under-sampling methods perform worse than any oversampling algorithm, it is worth pointing out that they still improve the AUC values achieved when classifying with the original imbalanced data set (this is the strategy with the highest average rank).
Focusing on the results of the SVM classifier in Table 6, one can observe that SMOTE, ROS and SMOTE þ WE are the best approaches with average ranks of 3.039, 3.346 and 3.423, respectively. In this case, the random under-sampling algorithm appears to be as good as those over-sampling strategies, with an average rank of 3.346. Once again, the SBC technique and the use of the imbalanced data set without any preprocessing correspond to the options with the highest average ranks (7.539 and 8.077, respectively).
If we now analyse the results obtained for the highly imbalanced data sets and those of a low/moderate imbalance ratio in Section 5.2, it is possible to notice that the best solution to the class imbalance problem consistently corresponds to over-sampling, independently of employing a statistical model or an artificial intelligence technique.
As in the case of the results for the data sets with a low/moderate ratio, a Nemenyi post hoc test (a ¼ 0.05) has also been applied to report any significant differences between all pairs of algorithms and then depicted by significance diagrams with a critical difference value of 3.332. Figure 2(a) shows the significance diagram for the logistic regression model, where the SMOTE þ WE technique proves to be significantly better than using any undersampling algorithm or the original imbalanced data set. The rest of over-sampling algorithms are also significantly better than OSS, SBC and the imbalanced sets.
For the SVM, Figure 2(b) allows to observe that differences among the resampling strategies are less significant than in the case of using logR. Nonetheless, one can see that five methods (SMOTE, ROS, RUS, SMOTE þ WE, SL-SMOTE) perform significantly better than SBC and the original imbalanced sets.

Conclusions
This paper has studied a number of resampling techniques for statistical and computational intelligence prediction models when addressing the class imbalance problem. The performance of these methods has been assessed by means of the AUC (balanced accuracy) measure, and then the Friedman statistic and the Nemenyi post hoc test have been applied to determine whether the differences between the average ranked performances were statistically significant. In order to better illustrate these statistical differences, the significance diagram for each classifier has been analysed.
The experiments carried out over real-world data sets with varying imbalance ratios have demonstrated that resampling can be an appropriate solution to the class imbalance problem in credit scoring. Also, the results have allowed to see that over-sampling outperforms undersampling in most cases, especially with the logistic regression prediction model where the Nemenyi test has shown more significant differences. Another interesting finding refers to the fact that the resampling approaches have produced similar gains in performance without regard to the imbalance ratio.
In credit scoring applications, a small increase in performance may result in significant future savings and have important commercial implications (Henley and Hand, 1997). Taking this into account, the improvement in performance achieved by the resampling strategies may become of great importance for banks and financial institutions. Therefore, it seems strongly advisable to face down the imbalance problem (probably by means of an oversampling technique) before building the prediction model.