Exploring the synergetic eﬀects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction

Credit risk and corporate bankruptcy prediction has widely been studied as a binary classiﬁcation problem using both advanced statistical and machine learning models. Ensembles of classiﬁers have demonstrated their eﬀective-ness for various applications in ﬁnance using data sets that are often characterized by imperfections such as irrelevant features, skewed classes, data set shift, and missing and noisy data. However, there are other corruptions in the data that might hinder the prediction performance mainly on the default or bankrupt (positive) cases, where the misclassiﬁcation costs are typically much higher than those associated to the non-default or non-bankrupt (negative) class. Here we characterize the complexity of 14 real-life ﬁnancial databases based on the diﬀerent types of positive samples. The objective is to gain some insight into the potential links between the performance of classiﬁer ensembles (BAGGING, AdaBoost, random subspace, DECORATE, rotation forest, random forest, and stochastic gradient boosting) and the positive sample types. Experimental results reveal that the performance of the ensembles indeed depends on the prevalent type of positive samples.


Introduction
In response to the 2008 global financial crisis, banks and regulatory agencies have increased their efforts to streamline processes and increase efficiency in the prediction and proactive management of credit risk, financial distress and corporate bankruptcy. Classical studies on this subject were initially based on advanced statistical models [1,2,3,4,5], such as logistic regression, probit analysis, linear discriminant analysis, survival analysis, linear and quadratic programming, and multivariate adaptive regression splines. Nevertheless, empirical results have shown that most underlying assumptions of these statistical approaches, such as multivariate normality and independence of the explanatory variables, are frequently violated [6,7].
Unlike the statistical models, machine learning and computational intelligence methods do not assume any specific prior knowledge, but instead they automatically extract information from past observations. These are represented by a set of explanatory variables, which usually correspond to financial ratios, macroeconomic indicators and socio-demographic characteristics, either straightforwardly represented as continuous variables or discretized as qualitative information.
Support vector machines [8,9,10], genetic and evolutionary algorithms [11,12,13], artificial neural networks [14,15,16,17,18], rough sets [19,20,21], and decision trees [22,23] have received much attention and widespread application in the field of finance and more specifically, to the prediction of credit risk, financial distress and corporate bankruptcy. Although numerous previous studies concluded that machine learning techniques are superior to statistical models, it has been argued that no single classifier can produce the best results on all the cases. From this conclusion, ensembles emerged as a powerful tool for exploiting the different behavior of a pool of individual (base) learners and reducing prediction errors in several financial applications. In fact, practical investigations have demonstrated that ensembles generally outperform stand-alone prediction methods in most credit risk and corporate bankruptcy prediction problems [24,25,26,27]. However, extensive researches have also shown the strengths and weaknesses of classifier ensembles against a diversity of intrinsic data characteristics, which could make the prediction of the positive cases even much more difficult; for in-stance, one can find studies on class imbalance [28], attribute noise [29], and data set shift [30], among others.
Since the error rate of default or bankrupt (positive) cases is of great importance for credit risk and corporate bankruptcy assessment, it could be useful to carry out a proper analysis on how the presence of samples of different nature in the positive class may affect the predictive performance of classifier ensembles. However, as far as we are aware, no previously reported study has systematically analyzed this problem in the framework of finance.
Therefore, considering the particular characteristics of financial data, the ultimate aim of this paper is to characterize the databases according to the prevalent type of samples in the minority class and also to explore the potential links between the performance of classifier ensembles and the different types of data sets. To this end, experiments will consist of characterizing 14 credit and bankruptcy data sets according to the positive sample types, and analyzing whether or not there may exist any correlation between these and the performance of several prediction systems based upon seven wellestablished ensembles that are built with three different base classifiers. As the number of positive samples is usually far less than the amount of negative samples, which leads to the well-renowned class imbalance problem, we cannot neglect this scenario when discussing the experimental results.
Henceforth the rest of the paper is organized as follows. Section 2 reviews some research works related to the use of ensembles to deal with various intrinsic data characteristics in the field of credit risk and corporate bankruptcy prediction. Next, Section 3 provides a categorization of the types of samples that can be found in a data set. The experimental set-up, databases and classifier ensembles are given in Section 4, whereas the results are reported and discussed in Section 5. Finally, the conclusions and possible avenues for further research are outlined in Section 6.

Intrinsic financial data characteristics in ensembles
The development of classifier ensembles for credit risk, financial distress and corporate bankruptcy prediction has attracted increasing attention of both researchers and practitioners in the last years. Many works have shown the superiority of ensembles over single classifiers, whereas some others have proposed new algorithms as alternatives to the existing ones. However, only a few works have paid attention to studying the behavior of ensembles when learning from data sets with several intrinsic data characteristics. Here, we summarize some of the most recent publications on this topic, but note that it does not intend to be a thorough review.
Das et al. [31] proposed that intrinsic data characteristics can be categorized into two groups: 1) distribution-based data irregularity (DistBI), and 2) feature-based data irregularity (FeatBI). The former involves class imbalance problem, small disjuncts, and class distribution skew, and the latter includes missing and absent features. We consider that the first group might also cover other problems, such as outliers, noisy data, small data set size, and data set shift. Analogously, we believe that the second group might also include noisy, irrelevant and redundant features. Taking this taxonomy into account, the Venn diagram in Figure 1 shows the relationship between both categories and the number of works in each group (see Table A.4 of Appendix A for a more detailed information). As can be seen, a majority of works have focused on the distribution-based data irregularities, where the class imbalance appears as the most studied problem. Only three works have faced the feature-based data irregularities, whereas five of them addressed both intrinsic data characteristics.

Distribution-based data irregularity
The class imbalance problem has been considered as a challenging task in a broad scope of financial problems. In last years, it has been very frequent to deal with imbalanced data, and several works [32,33] have studied the performance of many different ensembles on data sets with this intrinsic data characteristic.
Feng et al. [34] presented a dynamic ensemble model based on soft probability where the classifier selection was based on accuracy, precision and different costs of type I error and type II error. Experimental results showed that the proposed model outperforms BAGGING and random forest on several imbalanced credit data sets. In the same line, Xiao et al. [35] combined the dynamic classifier selection method with a cost-sensitive evaluation criteria. He et al. [36] introduced a cascade model that resamples the credit scoring data sets according to their imbalance ratio and a threshold. Each adjusted data set is used for training several random forests and extreme gradient boosting as base classifiers. Sun et al. [37] proposed an ensemble for imbalanced credit evaluation based on the SMOTE algorithm and the BAG-GING technique with different sampling rates. Wang et al. [38] combined the Lasso-logistic regression model with the BAGGING approach where this was used on the minority class to generate balanced training data sets. Sun et al. [39] combined SMOTE with the BAGGING algorithm using a support vector machine (SVM) as base learner. While all these works are characterized by incorporating the imbalance solutions into the ensemble, Louzada et al. [40] developed a new BAGGING algorithm, called Poly-BAGGING, where the resampling technique was not considered as part of the ensemble approach.
Xia et al. [41] designed a heterogeneous ensemble credit scoring model by integrating the BAGGING algorithm with the stacking method; despite the model introduced did not focus on class imbalance problems, it showed a good performance on moderately imbalanced data sets. Yu et al. [42] developed a three-stage ensemble model for dealing with class imbalance problems using BAGGING, SVM and a deep belief network. Abellán and Castellano [33] showed that an ensemble built with the credal decision tree performs better than others based on more complex base learners trained on balanced and imbalanced data sets. Ala'raj and Abbod [43] introduced a new combination approach based on classifier consensus that creates a ranking group as a fusion of individual classifiers. Experimental results showed that the consensus model achieves better performance in terms of the H-measure on highly imbalanced data sets. Florez-Lopez and Ramon-Jeronimo [44] developed a novel ensemble technique that follows a three-stage structured called the correlated-adjusted decision forest. Empirical results revealed their suitability on imbalance problems in terms of type I error and type II error. Kim et al. [45] proposed the geometric mean based boosting algorithm, which is a modification of AdaBoost using the concept of geometric error and accuracy calculation. Ziba et al. [46] used the extreme gradient boosting where each base learner was constructed using synthetic random features; the aim was to deal with class imbalance and small size problems. Li et al. [47] proposed a three-stage ensemble framework where in the first level, several perceptrons were used as base learners. In the middle level, a relevance vector machine was used to train weak learners. In the top of the framework a boosting algorithm was employed. Authors suggested that their proposal is suitable when the data set is imbalanced and there are noisy data. The data set shift problem occurs when the training and test data come from different distributions. To deal with this problem, Xiao et al. [30] proposed to use transfer learning into an ensemble model.

Feature-based data irregularity
Twala [48] performed an analysis on the behavior of several ensemble models when the data set shows different levels of attribute noise. The experimental results suggested that the impact of noise depends upon the classifier and the proportion of noise.
To eliminate irrelevant and redundant features, Muslim et al. [49] combined split feature reduction and BAGGING. Xia et al. [50] introduced a sequential extreme gradient boosting model that incorporates a preprocessing step to scale the data and handle missing values. In addition, a feature selection system was used to remove redundant variables. Koutanaei et al. [51] used feature selection algorithms as a first stage to remove noisy attributes. The reduced data sets were used on AdaBoost, BAGGING, random forest, and stacking. Wang et al. [52] introduced a feature selection algorithm into boosting to deal with irrelevant features.

Intersection between DistBI and FeatBI
Each intrinsic data characteristic does not constitute an isolated problem. Ala'raj and Abbod [53] used two preprocessing techniques, Gabriel neighborhood graph editing and multivariate adaptive regression splines, to reduce the size of the data set by filtering samples and choosing the most relevant features. Both algorithms were combined with a consensus ranking approach. Liao et al. [54] introduced an ensemble model with majority vote that combines SVM, multiple feature selection, artificial neural network (ANN), and rough set theory (RST). The SVM model was used to balancing the training set followed by a multiple feature selection algorithm to pick up the most representative features. To deal with noisy data and the class imbalance problem, Li et al. [47] proposed a relevance vector machine ensemble model that employs a soft margin boosting. Wang et al. [55] introduced a two-stag ensemble model based on decision tree, BAGGING and random subspace to deal with the noise data and redundant attributes. Paleologo et al. [56] proposed a sub-BAGGING algorithm where the base learners were generated by random sub-sampling in order to handle the class imbalance problem. Besides, an imputation method integrated into the ensemble model was used to handle missing data.

Types of samples
When analyzing the characteristics of a data set, an important question that deserves to pay some special attention refers to the identification of the different types of samples. This identification can be particularly useful to support interpretations of differences in the performance of classifiers because many data complexity factors are linked to the distribution of sample types in a data set [57,58].
According to the categorization proposed by several authors, two main types of samples can be distinguished: safe and unsafe [59,60,61]. Safe samples refer to those placed in homogeneous regions with data of a single class and are sufficiently separated from examples of the other class, whereas the remaining samples are deemed as unsafe. Most models classify the safe samples correctly, but the unsafe samples may make their learning especially difficult and more likely to be misclassified.
The property common to the unsafe samples is that they are located close to examples that belong to some different class. However, this type of samples can be further divided into three subgroups depending on their particular characteristics: borderline, rare and outlier [60,62]. Borderline samples are located near the decision boundary between classes. Rare samples are small groups of examples located far from the core of their class, creating small data chunks or sub-clusters. Finally, the outliers are single samples that are surrounded by examples from the other class.
A simple method to identify each sample type is based on analyzing the local neighborhood of the examples [60,61], which can be modeled either by their k-neighborhood or by using a kernel function. Thus, a safe sample is characterized by having a neighborhood dominated by examples that belong to its same class. Rare examples and outliers are mainly surrounded by examples from different classes, whereas the borderline samples are surrounded by examples both from their same class and also from a different class.
Following the standard strategy used in prior works [58,60,61,63], we determine the type of a sample s by comparing the number of its k nearest neighbors (with a constant value of k = 5) that belong to the class of s with the number of neighbors from the opposite class. Most authors choose k = 5 because smaller values may poorly distinguish the nature of examples and higher values would violate the local neighborhood assumption. Thus we can find the following cases: • A sample s is considered to be safe if at least 4 out of the 5 nearest neighbors belong to the class of s.
• A sample s is considered to be borderline if 2-3 out of its 5 nearest neighbors belong to the class of s.
• A sample s is considered to be rare if only one nearest neighbor belongs to the class of s, and this has no more than one neighbor from its same class.
• A sample s is considered to be outlier if all its nearest neighbors are from the opposite class.
This method has been proposed for the identification of the different sample types in the minority class, which is especially relevant when the class distribution is imbalanced. Note that in such a situation, the percentage of each sample type belonging to the majority and minority classes may differ massively from each other. For instance, consider a credit data set where only 1% of samples are defaulters and 99% are non-defaulters; under these conditions, it is likely that most of the safe samples belong to the majority class and most of the unsafe samples are in the minority class, which may disguise the true distribution of sample types in the data set.

Databases and experimental set-up
The experiments were designed to explore the potential impact of the different sample types on the prediction performance of classifier ensembles over a collection of bankruptcy and creditworthiness data sets. Table 1   A 10-fold cross-validation procedure was adopted with the purpose of avoiding biased results [70]. Each data set was randomly split into ten stratified blocks (or folds) of equal size. For each round, nine blocks are used for training and the remaining part for testing. This is repeated ten times using a different block for testing, thus ensuring that all folds are employed for both training and testing.
The performance of the classifiers was evaluated with three standard scores that have typically been used in financial applications. First, the area under the ROC curve (AUC) corresponds to an overall performance measure that allows decision-makers to compare examples against each other. Second, the true-positive rate (TPR) and true-negative rate (TNR) exhibit performance results on each class separately, thus taking care of the cost of different error types. Both these are particularly meaningful for the kind of real-life applications faced in this paper because the cost of false-negatives (predicting a default or bankrupt case as non-default or non-bankrupt) is often much higher than the cost associated to false-positives (non-defaulters predicted as defaulters) [71].

Ensembles of classifiers
We have chosen seven standard ensembles to evaluate whether or not there exists any connection between their prediction performance and the sample types in the data sets: BAGGING (Bootstrap AGGregatING, Bag), AdaBoost (ABoost), stochastic gradient boosting (SGBoost), random subspace (RSP), rotation forest (RotF), random forest (RndF), and DECO-RATE (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples, Decor).
The Bag technique [72] generates multiple bootstrap samples randomly drawn with replacement from the original training set. Next, each individual classifier is built for each sample, and predictions on new cases are made by combining the classification results using a majority voting policy.
Boosting [73] produces a sequence of base classifiers through successive bootstrap samples that are obtained by weighting the training data in a number of iterations. Initially equal weights are assigned to all training examples and at each iteration, boosting increases the weights on the examples predicted incorrectly by the previous individual classifier so that those misclassified examples are more likely to be chosen in the next bootstrap sample. Final decisions are based on a weighted majority voting scheme. Two of the most popular boosting algorithms are ABoost and SGBoost [74] (at each iteration a subsample of the training data is drawn at random without replacement from the full training set; the randomly selected subsample is then used, instead of the full sample, to fit the base learner).
In the RSP method proposed by Ho [75], the base classifiers are trained on sets constructed with a given proportion of variables picked randomly from the original set of features. The outputs of the individual classifiers are then combined into a final decision rule through a simple majority voting procedure.
RotF [76] trains each base classifier with a different set of extracted variables. The original feature set is randomly split into a number of subsets, principal component analysis is run separately on each subset, and a new set of linear extracted variables is constructed by pooling all principal components. The data is transformed linearly into the new feature space, and the base classifier is trained with this new data set.
The RndF developed by Breiman [77] is an ensemble of decision trees, each one built using a bootstrap sample of the training data and the candidate set of variables at each split is a random subset of the features. Each tree is unpruned, so as to obtain low-bias trees; in addition, bagging and random variable selection result in low correlation of the individual trees.
The Decor algorithm [78] uses a base learner to build an ensemble iteratively by adding different randomly generated examples to the training set when building new ensemble members. These artificially generated examples are given class labels that disagree with the prediction of the current ensemble, thereby increasing diversity when a new classifier is trained on the augmented data and incorporated into the ensemble.
The ensembles were built using different base classifiers that have been widely used in the financial industry: the unpruned C4.5 decision tree, the multi-layer perceptron (MLP) with one hidden layer and the k nearest neighbors (kNN) rule. These base classifiers have been chosen because it is known that some ensembles such as bagging do not work well with stable (low variance and possibly high bias) models. The hyperparameters for the MLP were tuned by withholding 50% of the training data as a validation set and testing the learning rate and the momentum from 0.1 to 0.3. The kNN rule was optimized using leaving-one-out on the training set to select the best k value between 1 and 30.

Characterization of the databases
To gain a better insight into the structure of the classes and a deeper understanding of the data complexity, the data sets are here characterized according to the sample types introduced in Section 3 and the imbalance ratio reported in Table 1. This can make easier the subsequent analysis between the predictive performance of classifier ensembles and the distribution of sample types in the data sets.
For each database, the percentage of samples that belong to each type was calculated for both the positive class ( Figure 2) and the negative class (Figure 3). The sample types were displayed as bar charts on the left y-axis of these graphs and the imbalance ratio was displayed on the right y-axis as a series plot. The data sets on the x-axis were sorted in ascending order of the imbalance ratio. Not surprisingly, comparison of both figures shows that the distribution of sample types in each class strongly depends on the imbalance ratio. Thus the amount of safe positive samples in the data sets with high imbalance was minimal (less than 3%), whereas the number of rare and outlier samples in the positive class increased with the imbalance ratio. On the other hand, for the highly imbalanced databases the proportion of safe samples in the negative class was very close to 100% and the percentage of unsafe samples was nearly 0%: the maximum amount of borderline and rare samples was for the Polish-5th database (4.69% and 0.27%, respectively), and that of outliers was for the Polish-1st database (0.03%).
These findings reinforce the idea that performing an accurate prediction is more complicated in the positive class than in the negative class because their proportions of safe and unsafe samples are very different, especially in the data sets with high imbalance; notwithstanding, some cases deserve further comments. For instance, both Finland and SabiSPQ data sets are perfectly balanced (IR = 1.0), but the former has a lower amount of safe samples in the positive class than the latter (52.40% and 82.42%, respectively). Analogously, the Polish and Japanese databases, which are characterized by similar imbalance ratios, present very different percentages of safe samples in the positive class (17.86% and 39.19%, respectively). Even more interesting is the comparison between SabiSPQ (IR = 1.0) and Polish (IR = 1.14) because there exist large differences in the amount of safe and borderline samples in the minority class of each database. This suggests that, as pointed out in other research works [57,63,79], the class imbalance is not the only problem that may degrade the classifier performance, but there are other intrinsic data characteristics that also hinder classification; therefore, an analysis of the overall structure of data can become of great relevance because it could provide some insights on choosing the most appropriate classifier ensemble depending on the distribution of sample types. From Figure 2, it is possible to categorize the experimental databases into five groups according to the prevalent type of samples in the minority class: • Safe: The Australian, Finland and SabiSPQ databases, which contain more than 50% of safe samples.
• Outlier: All the high imbalanced data sets because more than 40% of examples have been characterized as outliers.
• Safe-borderline: The Japanese database has approximately the same amount of safe and borderline examples in the positive class, about 40% each one.
• Borderline-rare: The Taiwan database comprises a majority of borderline and rare samples, which represent close to 70% of the minority class: 37% of borderline samples and 32% of rare samples.
The last four groups refer to unsafe databases because less than 50% of their positive samples have been identified as safe. On the other hand, the last two categories correspond to data sets in which the positive samples are mainly placed between the safe and the borderline groups in the first case or between the borderline and the rare groups in the second one.

Results and discussion
We report the results obtained in the course of the experimental study. The aim is to investigate how the prevalent type of positive samples affects the performance of each ensemble model. In other words, the question to answer here is whether or not there exists any difference in performance of the classifier ensembles from one category of databases to another.
We have divided this section into two parts. First, we investigate the performance of the ensembles by comparing the safe databases against all the unsafe data sets (i.e., safe-borderline, borderline, borderline-rare and outlier data). The second part of this section is devoted to analyze the behavior of the ensembles for each category of unsafe databases.
As the values of AUC, TPR and TNR can be very different from one data set to another, the use of the average scores across the databases could be inadequate. Instead we calculated Friedman's average ranks for the classifier ensembles. From this, one has to consider that the prediction model with the lowest average rank corresponds to the best algorithm. In addition, the full set of results is provided in Tables B.5-B.10 of Appendix B.

Performance analysis of the ensembles on the safe and unsafe data
The focus of the first block of experiments is on analyzing the possible differences in the behavior of the ensembles between the safe databases and the unsafe databases. To this end, Figure 4 displays the Friedman's average ranks of AUC for each classifier ensemble applied to both these categories of data sets. The stand-alone classifiers were also included in these plots as a baseline. As can be observed, there exist significant differences between the safe data and the unsafe data, irrespective of the base classifier used to build the ensembles. In the case of the MLP-based models, the best ensembles were bagging and random subspace for the safe databases, and DECORATE and AdaBoost for the unsafe data sets. It is particularly appealing the behavior of AdaBoost because this was one of the best ensembles on the unsafe databases, but it performed even worse than the stand-alone MLP on the safe databases. Similar comments can be made for the kNN-based ensembles, in which RSP obtained the lowest average rank for the safe data sets and it was the worst technique for the unsafe ones. For the C4.5-based ensembles, the random forest was the best performing method for both safe and unsafe data, but the remaining models showed a significantly different behavior when applied to safe or unsafe data sets.
As AUC represents an overall, scalar performance evaluation measure, it can give rise to misleading conclusions when the cost of misclassifying examples in one class is very different from the cost of misclassifying examples in the other class, or when the class distribution is imbalanced [80,81,82]. In such cases, it is also especially important to evaluate the true-positive and true-negative rates. The former is the primary goal in credit risk and corporate bankruptcy prediction, but high true-positive rates should not compromise the correct classification of the majority class. To balance these two competing goals, a normalized Euclidean distance between each (average rank of TPR, average rank of TNR) pair and the origin (1, 1) was calculated and reported in Table 2. Using this measure, the best model for each type of data was the one that produced the smallest distance (highlighted in bold in Table 2). Another way of visualizing this consisted of plotting the Friedman's average ranks of TPR versus the Friedman's average ranks of TNR in Figure 5, and looking for the point that was closest to the bottom left corner of the graphs; thus the closer the ensemble was to the bottom left corner, the higher the performance on both classes. Note that this graph depicts relative tradeoffs between TPR and TNR.
Although the ultimate objective of any classification system is to achieve high rates on both classes (that is, the ensembles with the smallest distance  Table 2), in general it will be preferable to maximize TPR rather than to maximize TNR. This means that the ensembles close to the left side of the charts will be considered better than the ensembles close to the bottom side. One can observe in Figure 5 that the best models were: (i) BAGGING for the safe databases and DECORATE for the unsafe ones in the MLPbased models; (ii) RSP for the safe data sets and BAGGING for the unsafe ones in the kNN-based ensembles; and (iii) random forest for the safe data and BAGGING, random forest and rotation forest (all three algorithms with nearly the same normalized distance) for the unsafe ones in the C4.5-based methods. As can be seen, some of these conclusions do not agree with those drawn from the AUC analysis because this measure can be biased towards the majority class.
In summary, the main conclusion from this first analysis is that there indeed exist differences in the performance of the ensembles depending on the prevalent type of data, that is, each particular classifier ensemble does not perform equally well on safe and unsafe databases. Therefore, the next step will be to explore the behavior of the ensembles on the different types of databases that belong to the general category of unsafe data in order to investigate the possible links between each type and the performance of the ensembles.

Performance analysis of the ensembles on the unsafe data
The purpose of the second block of experiments is to establish the best performing ensemble for each type of unsafe databases when using each of the base classifiers. Thus Table 3 reports the normalized distance measure for all ensembles and the four types of unsafe data. In addition, each graph in Figures 6, 7 and 8 displays the Friedman's average ranks of TPR against those of TNR given by the MLP-based, kNN-based and C4.5-based ensembles, respectively.
Focusing on the results of the MLP-based ensembles in Table 3 and Figure 6, one can observe that the DECORATE algorithm achieved the highest performance (the smallest distance) for the safe-borderline databases, which correspond to the easiest type of unsafe data. BAGGING was the best ensemble for the borderline-rare databases. In the case of the outlier data sets (i.e., the most complex data structures), both DECORATE and random subspace were the techniques with the smallest normalized distance. On the other hand, in general the performance of AdaBoost was similar to that of the stand-alone MLP classifier, thus suggesting that this ensemble configuration is of little value to deal with most types of unsafe data sets. In summary, it seems that both DECORATE and BAGGING can be claimed as the best overall MLP-based ensembles when there is a majority of unsafe samples in the positive class. With regards to the kNN-based methods, Table 3 indicates that RSP was the best performing algorithm for the safe-borderline data sets, BAG-GING for the borderline data, DECORATE for the borderline-rare data, and AdaBoost for the outlier databases. However, the observation of plots in Figure 7 reveals that BAGGING achieved a lowest average rank of TPR than DECORATE for the borderline-rare data sets, thus suggesting that the former could be better than the latter. Similarly, BAGGING could also be considered to be better than AdaBoost for the outlier databases because it obtained a significantly lower average rank of TPR.
Finally, looking at the results obtained with the C4.5-based ensembles in Table 3, it is apparent that random forest was the most powerful model when the databases were characterized by a majority of safe-borderline, borderline or borderline-rare samples in the positive class. For these databases, Figure 8 shows that other ensembles achieved the best average ranks of TPR, but at the cost of producing very significantly higher error rates on the negative class; for instance, AdaBoost obtained the lowest average rank of TPR in the borderline databases, but its average rank of TNR was much higher than the one of random forest. In the case of the outlier databases, Table 3 indicates that the best ensembles were BAGGING and AdaBoost, but the observation of results plotted in Figure 8 discloses that the latter performed much better on the positive class than the former. It is also worth pointing out that the random forest applied to the outlier data sets gave very poor performance results in terms of TPR, suggesting that this ensemble should not be used when a very large percentage of positive samples are outliers.

Concluding remarks and future work
This paper has addressed the problem of credit risk and bankruptcy prediction with classifier ensembles pursuing to investigate whether or not there exists any potential difference in their performance due to the distribution The analysis on each category of databases has shown that the performance of any ensemble configuration indeed depends on the types of samples available in the data set. This finding can be especially useful when one has decide which classifier to apply for a particular problem in hand, thus avoiding to choose by a trial-and-error approach the most appropriate prediction model.
For future research, a natural extension to this work will consist in devel- oping a meta-learning framework that should be viewed as a decision support tool based on the characteristics of each database for the design of classification systems with the capability of achieving the highest performance. Another avenue for further research is to compare the performance of ensembles across the type of credit data sets (i.e., retail credit data versus corporate credit data) and investigate whether or not there exists any correlation with the performance of ensembles based on the sample types.

Appendix B. Full set of results
This appendix provides the results in terms of AUC, true-positive rate and true-negative rate achieved by each ensemble configuration over the databases included in the experiments. The first three columns in Tables B.5-B.7 are for the safe data sets, the following three are for the borderline ones, and the last column is for the safe-borderline database.The first seven columns in Tables B.8-B.10 are for the outlier data sets, and the last column is for the borderline-rare database. [3] C. V. Zavgren, Assessing the vulnerability to failure of American in-    [6] G. V. Karels, A. J. Prakash, Multivariate normality and forecasting of