An insight into the experimental design for credit risk and corporate bankruptcy prediction systems

Over the last years, it has been observed an increasing interest of the finance and business communities in any application tool related to the prediction of credit and bankruptcy risk, probably due to the need of more robust decision-making systems capable of managing and analyzing complex data. As a result, plentiful techniques have been developed with the aim of producing accurate prediction models that are able to tackle these issues. However, the design of experiments to assess and compare these models has attracted little attention so far, even though it plays an important role in validating and supporting the theoretical evidence of performance. The experimental design should be done carefully for the results to hold significance; otherwise, it might be a potential source of misleading and contradictory conclusions about the benefits of using a particular prediction system. In this work, we review more than 140 papers published in refereed journals within the period 2000–2013, putting the emphasis on the bases of the experimental design in credit scoring and bankruptcy prediction applications. We provide some caveats and guidelines for the usage of databases, data splitting methods, performance evaluation metrics and hypothesis testing procedures in order to converge on a systematic, consistent validation standard.


Introduction
Credit risk and corporate bankruptcy prediction constitutes an application domain of major interest for banks and financial institutions because erroneous decisions may lead to very important costs (Horcher 2005). This is the reason why the development of a great variety of strategies to implement reliable prediction models has attracted considerable attention both from academicians and financial analysts over the last decades. These range from very traditional statistical techniques (e.g., weight of evidence, logistic regression, discriminant analysis, multivariate adaptive regression splines, probit analysis) to more sophisticated computational intelligence paradigms (e.g., neural networks, support vector machines, evolutionary computing, fuzzy algorithms, expert systems) and operations research methodologies (e.g., mathematical programming, multi-criteria decision making methods).
Prediction of credit risk and bankruptcy can be performed through the generation of models, which are usually based on a binary classification approach, in order to distinguish potential defaulters (bankrupters) from non-defaulters (non-bankrupters). From a practical point of view, classification refers to the assignment of a finite set of samples to predefined classes based on a number of observed variables or attributes (Thomas et al. 2002). For instance, the input of a credit scoring system may consist of a collection of historical information that describes socio-demographic characteristics and economic conditions of the applicant, and the classification model produces the output in terms of the customer creditworthiness.
Despite the growing interest in developing more accurate prediction models, the issue of how these models should be evaluated and their results thoroughly validated has not been investigated sufficiently so far. An example of this paradox is the considerable number of surveys that summarize the many techniques proposed in the literature and/or compare their performance results, but they do not concentrate on how the experiments have been designed. Just to cite a few recent examples, Crook et al. (2007) review a selection of statistical models, mathematical programming and soft computing techniques for consumer credit risk assessment. Ravi Kumar and Ravi (2007) present an extensive analysis of statistical and intelligent methods applied to the prediction of corporate bankruptcy risk in the period 1968-2005, highlighting the source of data, financial ratios and country of origin. Verikas et al. (2010) focus their review on how to combine different soft computing techniques to derive hybrid and ensemble-based bankruptcy prediction models. Lin et al. (2012) provide a statistical survey of machine learning papers published between 1995 and 2010 in the realm of credit scoring and bankruptcy prediction, summing up the data sets and comparing the performance of several methods with baseline classifiers. Abdou and Pointon (2011) present a literature review of works related to credit scoring applications in various areas, with the aim of investigating how this field has grown in importance over the last decades and also identifying the primary factors in the construction of a credit scoring model. Sadatrasoul et al. (2013) give a comprehensive review of studies where data mining techniques have been applied to credit scoring from 2000 to 2012.
Unfortunately, none of these surveys provides a deep insight into the process of experimentation and validation, even though it is widely accepted that the proper design of experiments constitutes a paramount factor to ensure a complete understanding and testing of the performance of the prediction models developed (Cohen 1995). At least four key components should be defined carefully in order to draw well-founded conclusions from the results: the experimental data, the data splitting methods, the performance evaluation metrics and the statistical tests of significance. Nevertheless, the configuration of these elements is often done in a blind manner within the experimental framework for the prediction of credit risk and bankruptcy.
This application area presents certain dominant characteristics that make the design of experiments especially critical and challenging, with a number of particularities that differ from other real-life applications in two aspects. First some problems are recurrent, and second they appear in combination with other complexities. The following list reports some of the most significant features of credit risk and corporate bankruptcy prediction.
-Data sets are typically characterized by highly imbalanced class distribution with a scarcity of default observations, which is often referred to as the low-default portfolio problem. Some of these characteristics should carefully be taken into consideration when designing the experiments because there is evidence that they may affect the experimental results strongly. For instance, the imbalanced nature of data in a credit risk application and the asymmetric misclassification costs require the use of performance evaluation metrics that are not biased towards the majority class. Also, the multiple (usually conflicting) criteria may give rise to contradictory predictions if an inappropriate single measure is used to evaluate the models. On the other hand, the size of data sets determines how to split the data, and this becomes even more important with a high imbalance ratio. In our opinion, it seems clear enough the importance of keeping in mind the particularities of this application area in order to define a comprehensive experimental methodology.
Accordingly, this work conducts a systematic review of more than 140 papers published in refereed journals within the period between 2000 and 2013. The purpose of this survey is studying how the experiments have been designed and the results validated in the field of credit risk and corporate bankruptcy prediction. To this end, each of the four aforementioned experimental components will be analyzed, while discussing the limitations of the standard configurations used in current practice and providing suggestions to establish a more robust experimental methodology that can help authors enhance their studies. However, we would like to elucidate that our analysis does not intend to criticize any previous research efforts.
Henceforth, this paper is organized as follows. Section 2 gives an overview of the research methodology we have adopted to conduct our investigation. Section 3 analyzes the experimental databases in terms of their sources and sizes. Section 4 discusses the data splitting methods and suggests how they should be applied in order to yield consistent results. Section 5 outlines the criteria used to measure the model performance and points out the adequacy of each one depending on the characteristics of the databases. The most common statistical methods used to test the significance of performance results are studied in Section 6. Two simple experimental scenarios are included in Section 7 in order to stress the different performance results and the conclusions that can be drawn from them depending on the experimental methodology adopted. Next, Section 8 proffers a set of caveats and simple guidelines for better experimental design and validation of credit risk and corporate bankruptcy prediction models. Finally, Section 9 remarks the main findings of our research. Figure 1 illustrates an overall picture of the main steps involved in the research process followed for conducting our study. This process is based upon the suggestions given by Staples and Niazi (2007) and comprises two basic phases (each one with a sequence of steps): the definition stage to establish the purpose and the protocol for the research, and the development stage for collecting related papers, handling the relevant data, analyzing the results and drawing conclusions. It is worth noting that this process is not linear, rather it requires iteration, feedback and refinement.

Research methodology
The definition stage consists of three steps. In the first one, we have to identify the need for a new research to cover a gap in the domain of study. Afterwards, the definition of the general objectives allows to better circumscribe the specific research to be undertaken, whereas the definition of the methodology aims at giving a formal and detailed protocol for the execution of the research. Both the identification of a gap and the definition of the objectives (the first two steps in Fig. 1) have already been addressed in the previous section, while the definition of the research methodology will be discussed next in this section.
The first step of the development stage employs a particular search strategy to retrieve an initial list of publications that may be relevant to the objectives. Nonetheless, this process needs further refinement in order to exclude some papers that do not fulfill the research requirements completely and include some others that may be of interest to our study. In particular, the present investigation on experimental design for credit scoring and corporate bankruptcy risk prediction was carried out by cross-searching for related journal papers with the support of eight comprehensive bibliographic databases: ISI Web of Science, Google Scholar, IEEE Xplore, SpringerLink, Scopus, Inspec, ScienceDirect, and ACM Digital Library. Conference papers were excluded from the initial list of studies because in general, Fig. 1 Block diagram of the research process followed in this study the empirical work in this kind of publications appears to be much less exhaustive than in journal articles due to the lack of space; therefore, their inclusion could give rise to erroneous conclusions. Besides, the reference section in each of the retrieved papers was also scanned to add up some other relevant studies not included in the initial list of publications.
From the final list of 142 papers, the data extraction step was designed to collect data pertinent to the present study. For each article, we recorded the journal title and the year of publication, along with the databases, the data splitting procedures, the performance evaluation metrics and the hypothesis testing methods used in the experiments. Then all this information was organized in the form of a table to make easier the computation of statistics and the analysis of results.
We collected papers on credit risk and bankruptcy prediction from more than 50 scientific journals, which are mostly related to the fields of management, operations research, information and computer science, economics, and finance. Table 1 summarizes the journals with at least four articles included in the present study, reporting the number of papers along with the proportions and cumulative proportions for each journal. As can be observed, eight journals contribute with almost 60 % of the total amount of papers in review, but one should not overlook the remaining publications because some relevant results might be missed out.

Databases
The first component that has to be chosen carefully in the experimental design is the data with which to perform the experiments. As soon as one starts to set up the experimental protocol, several questions regarding the number of databases to use, the data set size, or the type of variables arise. Therefore, one should take care of all these questions in order to define an appropriate configuration of the experiments with the aim of maximizing the significance of the results.
From the literature review carried out, we have mainly observed two significant trends regarding the data used for the experiments. First, several works have employed benchmarking databases such as the extremely well-known Australian and German data sets, which can be taken from the UCI Machine Learning Database Repository (Bache and Lichman 2013). Even though these data sets are among the most widely used for credit scoring and bankruptcy prediction, many other studies have experimented with private databases collected by several local financial institutions, which are generally thought to face a specific application problem. Each of these two options has its pros and cons. In the case of using benchmarking databases, the main advantage is that they allow future experimenters to make extensive comparisons between different prediction models; however, these data sets may not be representative enough of the current socio-economic conditions and hence the experiments may lead to outdated and worthless conclusions. Conversely, application-oriented databases are mainly thought to tackle some particular real-world problems, but there may be difficulties to employ them for further comparisons. Also, it is worth stressing that many studies with private data do not include a complete description of the variables that comprise the samples, and even others do not provide the database size (e.g. Pavlenko and Chernyak 2010), the number of variables (e.g. Pavlenko and Chernyak 2010;Ben-David and Frank 2009), or the proportion of samples that belong to each class of the data set (e.g. Galindo and Tamayo 2000;Hoffmann et al. 2002), thus making difficult to understand in depth the merits (or faults) and procedural issues of each model.
Because of the shortcomings related to the individual usage of either public or private data, it will be generally better to employ a mixture of both benchmarking and applicationoriented databases. Nevertheless, as can be seen in Fig. 2, only 13 % of the papers reviewed have involved both types of databases in their experiments, while the rest is distributed between those that have employed only public databases and mainly those that have experimented with only private data sets. As can be seen, nearly a quarter of the studies have focused only on benchmarking databases, whereas about two thirds have used only data gathered from their own sources.
As a final comment, it should be remarked that more than 68 % of the papers analyzed in this survey have limited their experiments to a single database (see Fig. 2); thus it is not possible to extrapolate any conclusions about the strengths and weaknesses of a prediction method to other data from different financial institutions. There exist supporting empirical evidences that it is preferable to use several different data sets for model evaluation rather than a unique database in order to draw significant and meaningful conclusions, but only 6.38 % of the papers have included five or more data sets in their experiments.

Data set size
An important characteristic of the databases that should be analyzed in depth refers to their size, which is determined by both the number of examples and the number of attributes or Fig. 2 Percentages of papers (a) using benchmarking, application-oriented or a mixture of databases, (b) as a function of the number of databases used in the experiments independent variables. A drawback common to most of the data sets used in these papers relates to the small sample size, which may produce a relatively high variance of any statistic calculated from them. In order to better understand how are the databases, we have classified them into three categories based on the number of samples available: small size (less than 1,000 samples), medium size (1,000 -10,000 samples) and large size (more than 10,000 samples). The papers reviewed have considered more than 110 different databases, with 59 being small, 40 medium and 13 large sized. Table 8 in Appendix A provides a brief description of the databases sorted in ascending order by the total number of samples (N), and also reports the number of attributes (D) and the references to the papers that have conducted experiments over each database. In a significant number of studies, it is possible to observe that most data sets consist of a very small  Figure 3 shows the number of papers per year as a function of the size of the data sets used in the experiments. It has to be noted that the Japanese, Australian and German databases have not been considered for this analysis because they have extensively been employed in many works and therefore their inclusion could distort any conclusions. As can be observed, there is a constant trend toward the use of small and medium sized databases across the period of study. However, it seems that experimentation with larger data sets has increased moderately over the last years. On the other hand, the values of N and D in Table 8 suggest that there does not exist a strong correlation between the number of variables and the sample size of the databases used in the reviewed papers. As a way of checking our claim, we have computed the Pearson's correlation coefficient between these two features, which certainly corroborates a low degree of correlation (r = 0.27). In fact, the number of attributes is in the range of 5 to 30 for 89 % of small, 82 % of medium and 46 % of large sized databases. While this number of variables (5-30) can be suitable for databases with more than 1000 samples, it can become a hindrance for databases with a limited number of samples because the performance of a prediction model decreases as the dimensionality increases. Therefore, it is important to examine the ratio of the number of samples to the number of variables because this can lead to problematic situations due to the so-called Hughes phenomenon (Hughes 1968), which states that the ratio of the number of samples to the number of attributes must be maintained at or above some minimum value to achieve accurate predictions. Although there is no strict guideline about what a sufficient data size is, Nagy (2004) claims that it should be around 10 × D × C, where C is the number of classes in a problem. Unfortunately, several databases included in Table 8 do not fulfill this rule, such as the case of the Lithuanian database with 60 variables and 100 samples (Boguslauskas and Mileris 2009;Mileris 2010) or the Shanghai/Shenzhen Stock Exchange data with 30 variables and 153 samples (Li and Sun 2009).

Data splitting methods
The fundamental idea behind data splitting (or resampling) is very simple: we isolate one part of the data, learn on the rest, and then evaluate the model on the portion that was isolated. Briefly, data splitting methods are based on some form of partitioning of the available data into a training set for building the classifier or prediction model and a test set that will be used only for model assessment. In general, the larger the training set, the better the classifier; but also the larger the test set, the more accurate the performance estimate. In the case of credit risk and bankruptcy prediction, where the amount of data is usually very scarce, the resampling strategies become of great relevance for reliable model evaluation. Thus correctness of experimental results strongly depends on the selection of an appropriate resampling method, which in turn should be based on the data available for the experiments. As data are limited, one has to find a trade-off between the size of the training set and the size of the test set. The data splitting procedures (Alpaydin 2010) that have mostly adopted by the papers analyzed in this work are: -Holdout. The data set S is randomly split into two disjoint subsets, S tra and S tst ; the model is built using the samples in S tra and assessed on samples in S tst . -K-fold cross validation (CV). The data set S with N samples is randomly divided into K mutually exclusive subsets of approximately equal size, Each subset is in turn left out during model building; the model is trained on the union of the remaining K−1 subsets and predictions are obtained for the left out subset. After the K rounds of training and testing are complete, all the test set predictions are used to estimate the model performance. -Leave-one-out (LOO). This is a particular case of K-fold cross validation with K = N, that is, a single sample is left out each time; at each round, K − 1 cases are used for training and only one for testing. -K 1 ×K 2 -fold CV. The K 2 -fold cross validation method is repeated K 1 times and then the model performance is obtained as the average of the K 1 ×K 2 estimates.
An important issue closely related to resampling is stratification, which ensures that the class distribution of the original data set is preserved in the training and test sets, that is, the prior class probabilities should be kept in all partitions. This avoids the potential problem of generating some subsets with no examples from one of the two classes (Forman and Scholz 2010). On the other hand, it has been demonstrated that stratification helps to reduce the variance of the estimated performance (Kohavi 1995), especially for data sets with many classes (Sechidis et al. 2011). Despite its relevance, a great majority of papers do not indicate whether or not they have used a stratified data splitting technique.
In the holdout method, the key question is how many samples should be left out for the test set. It has been observed that the holdout estimator tends to be too pessimistic because only a proportion of the data is used to build the model (Bischl et al. 2012). Correspondingly, a variation of the holdout method, which partially alleviates this biased behavior, consists of replicating the partition into training and test sets several times in different random ways; the classifier is trained and tested for each partition and the performances averaged to yield an overall estimate, which is generally more reliable.
For K-fold cross validation, the question is how many subsets should be used. With a large number of subsets, the estimator will be very accurate, but the variance will be large. Conversely, with a reduced number of subsets, the variance will be small, but the estimator will be largely biased (i.e, too conservative) (Bischl et al. 2012). Although K = 5 and K = 10 are common choices that perform reasonably well for data sets of different sizes, it is worth noting that for very small data sets, a bigger value of K (or even the leave-one-out method) may become slightly preferable in order to train on as many examples as possible. Table 2 provides the distribution of papers according to the usage of the most typical resampling procedures. A simple glance at this table reveals that single holdout and K-fold cross validation are indeed the most popular resampling algorithms in the field of credit risk and corporate bankruptcy prediction, being applied on nearly 66 % of the articles. The repeated holdout method has been chosen in less than 15 % of the papers, showing that not many researchers are aware of the need for multiple runs. Paradoxically, despite the small size of many of these databases, the leave-one-out estimator has been employed only in 7 % of the studies. It has also to be noted that about 8 % of the papers have not indicated the data splitting procedure used in their experiments, which makes quite difficult to figure out the correctness of the results and the consistency of the conclusions.
Another question that deserves to be analyzed is whether there exists any relationship between the data set size and the resampling method used. To this end, Table 3 reports how many articles have used a given data splitting method with small, medium and large databases. For instance, three different papers have employed the K 1 ×K 2 -fold cross validation over small and medium sized data sets. Despite holdout and K-fold cross validation correspond to the resampling strategies with the lowest cost, they are the most widely-used methods even for small and medium sized databases. As can be observed, leave-one-out is applied when the data size is small because its computational burden is likely to be too high for databases with more than 1000 samples. Although the reduced number of papers that have experimented with large databases does not allow to draw any conclusions for this category, it seems that the use of leave-one-out and K 1 ×K 2 -fold cross validation has been discarded because of their high time-consuming nature. Even though the seeming relationship between data set size and resampling strategy, a more in-depth analysis of the papers shows that different authors have used different data splitting methods over a same database. This is especially obvious in the case of the Japanese, Australian and German databases, where holdout, K-fold cross validation and repeated holdout have all been applied equally. But this can also be found in many other data sets, such as the US bank database where Li et al. (2008) applied the holdout method, (Peng et al. 2008(Peng et al. , 2011 used 10-fold cross validation and Zhou et al. (2011) employed the repeated holdout approach. In addition, some articles with experiments over various databases apply the same data splitting method regardless of the data size. For instance, Brown and Mues (2012) use holdout on five data sets with different sizes ranging from 547 to 7190 samples, and García et al. (2012) apply 5-fold cross validation on eight data sets with sizes ranging from 240 to 5000 samples. This suggests that the choice of a particular resampling strategy is not always based on the size of data, but it may depend on the preferences of each author.

Performance evaluation metrics
The third component to be considered in the design of experiments involves how to assess the performance of the models tested on the data that have previously been picked out, that is, one has to select the performance evaluation measure (or a collection of them) that better fits the specific problem under consideration. In the framework of classification, the purpose of most performance evaluation metrics is to estimate how well the learned model predicts the correct class of new input samples, but not all of them are addressed to measure the same things. Therefore, the key question is to choose the most appropriate criteria that satisfy the special requirements for the problem in hand; otherwise, the results could lead to distorted conclusions since different metrics may yield different orderings of model performance (Hand 2012;Raeder et al. 2012). In this section we examine the most popular scalar metrics used in the credit scoring and bankruptcy literature, restricting the discussion to the two-class problem because this is the most general case when undertaking these financial applications. For consistency with the common terminology used in this context, we will refer to the 'good' risk class (i.e., non-default, non-bankrupt) as positive and the 'bad' class (i.e., default, bankrupt) as negative. Classification accuracy (acc) and its counterpart, the error rate, have been by far the most frequently employed indicators of performance in the papers reviewed (more than 88 % of the papers include the accuracy or the error in their experiments). For a twoclass problem, both these metrics can be derived from a 2 × 2 confusion or co-occurrence matrix as the one in Table 4, where columns represent the predicted class and rows indicate the true class; each entry (i, j ) contains the number of correct/incorrect predictions. Its diagonal contains the number of cases that have correctly been predicted for each class, while the off-diagonal elements indicate the number of samples that have been classified wrongly.
Both accuracy and error rate assume symmetric misclassification costs for the positive and negative classes (good observations being predicted as bad, and vice versa). This is the reason why approximately 41 % of the papers also measure the error on each individual class by using the so-called type-I and type-II errors. Type-I error (or miss) is the rate of bad cases being categorized as good; when this happens, the misclassified bad customers will become default. Correspondingly, if the credit granting policy of a financial institution is too generous, this will be exposed to high credit risk. On the other hand, type-II error (or false-alarm) defines the rate of good samples being predicted as bad; when this happens, the misclassified non-defaulters are refused and therefore, the financial institution has opportunity cost caused by the loss of those good customers. In general, type-I errors have much stronger impact on the creditor firms than type-II errors (Caouette et al. 2008).
Apart from these metrics, the papers gathered in the present survey indicate that some other straightforward indices, which can be formulated from the confusion matrix, are also considered in this context (about 10 % of the papers have used all or a subset of these measures). Among others, we can highlight sensitivity, specificity and precision. Sensitivity (Se) or recall is the proportion of positive cases that are correctly predicted as positive, specificity (Sp) is the proportion of negative examples that are correctly predicted as negative, and precision (Prec) or positive predictive value is defined as the proportion of cases labeled as positive. However, the use of these scores presents some apparent limitations (Hand 2012); for instance, one can achieve a sensitivity of 1 simply by predicting all the observations as positive, but at the cost of misclassifying all negative samples, thus producing a specificity of 0.
Other criteria less commonly employed in the evaluation of credit risk and bankruptcy prediction models are the mean absolute error (MAE), the root mean squared error (RMSE), the area under the ROC curve (AUC), the Gini coefficient, the Kolmogorov-Smirnov (K-S) statistic, the F-measure, and the H-measure. From these, the AUC corresponds to the most preferred score, which is usually calculated as the empirical probability that a randomlychosen positive observation is ranked above a randomly-chosen negative example. The Fmeasure is a widely-used metric in information retrieval and represents the harmonic mean of sensitivity and precision, whereas the H-measure (Hand 2009) is a recently developed threshold-varying evaluation score that calculates the expected loss of the classifier (as a proportion of the maximum possible loss) under a hypothetical probability distribution of the class imbalance ratio.
Another relevant evaluation metric in the areas of finance and banking is the estimated misclassification cost (West 2000), which takes care of the unequal costs associated with making type-I or type-II errors. However, the misclassification costs are seldom available because the estimate of their values is a complex and challenging task (Lee and Chen 2005). In fact, only 5 % of the papers reviewed have included the expected misclassification cost in their experimental protocol. If C 1 and C 2 denote the costs associated with type-I error for false positives and with type-II error for false negatives respectively, then the estimated misclassification cost (Provost and Fawcett 2001) can be calculated as Cost = C 1 ×F N + C 2 ×F P .
Since the risk for false positives is usually much higher than that for false negatives, the assumption that the ratio of the cost C 1 to the cost C 2 is more than 2:1 is fairly realistic in this field. For example, the ratio of the misclassification cost for type-I error to the misclassification cost for type-II error in the German database was reported to be 5:1 (West 2000), which has further been taken as the ratio between the costs of both errors for other data in a number of papers (Abdou et al. 2007(Abdou et al. , 2008(Abdou et al. , 2009bLee and Chen 2005). Figure 4 displays the percentages of papers that have employed each of the most typical performance metrics. As already pointed out, accuracy and error rate are the most frequently used measures in credit scoring and bankruptcy prediction. However, nearly half of these papers have also considered type-I and type-II error rates to measure the proportions of false positives and false negatives separately. AUC appears to be the third most used score in this context. Finally, although the misclassification cost is especially relevant for most applications of financial risk prediction, only a few papers have calculated this performance metric, mainly due to the difficulty of estimating the true costs associated to each type of error.
At this point, it is worth noting that a very usual problem related to credit risk and bankruptcy prediction arises when the data set is skewed, that is, the class of non-defaults (non-bankrupts) vastly outnumbers the class of defaults (bankrupts) and probably the minority class has a higher misclassification cost (Phua et al. 2004;Kiefer 2009;Catal 2012). This is a very important issue that should be addressed carefully when choosing a model performance score because many metrics are biased towards the majority class and therefore they  can be inappropriate for this kind of financial applications. It is surprising that the strongly biased classification accuracy is still the only measure reported in various studies, despite the voices arguing that other criteria should be used instead. In this sense, for instance, the AUC appears to be a more appropriate performance measure than accuracy for imbalanced data sets because it does not implicitly assume equal misclassification costs. However, several researchers do not take these arguments into account as can be seen in Table 5, which reports the performance metrics chosen in a number of papers with skewed databases. For instance, the negative class in the paper by Malhotra and Malhotra (2003) represents about 7 % of the whole database, but the model performance is solely evaluated by means of the prediction accuracy. Similarly, the bankrupt firms in the paper by Sun and Shenoy (2007) constitute less than 12 % of the data, but the classification accuracy is the only measure included in the experiments.

Statistical tests of significance
It is important to take into consideration that simple superiority of a prediction model in terms of some performance score on a test set, or any other comparison based on data splitting, results naive and is not sufficient to guarantee that it certainly performs better than the rest of methods. For a complete performance evaluation, it seems pertinent to adopt some hypothesis testing in order to assert that the observed differences in performance are statistically significant, and are not merely due to random splitting effects. Statistical validation of the results has been considered for a long time an essential part of the experimental framework, but its practical use has led to much debate in several fields of science (Chow 1998;Berrar and Lozano 2013). Choosing the right test for a specific collection of experiments depends upon several factors such as the number of data sets, the number of algorithms to be compared and the scale of measurement of the output variable (binary, nominal, interval, ordinal) (Marusteri and Bacarea 2010). On the other hand, one has also to take into account that some statistical tests are based on the assumption that the data are sampled from a normal distribution. These are the parametric tests, in contrast to the non-parametric tests which do not make assumptions about the population distribution. Although the parametric tests are, in general, more powerful than the non-parametric ones, it is not always easy to decide whether the sample comes from a normal population. In these cases, especially when the sample size is small, the use of a parametric test can be conceptually inappropriate and statistically inaccurate, and therefore it will often be preferable to apply a non-parametric procedure (Demšar 2006;García S 2010).
Based on the review carried out, several comments can be outlined: (i) the use of statistical procedures either for determining the optimal method or for comparing the performance of different prediction models appears to be infrequent since more than 68 % of papers have not reported any form of hypothesis testing; (ii) the parametric tests have been applied in nearly 18 % of papers (especially the t-test with about 15 %), but ignoring whether the samples hold the normality and homoscedasticity assumptions or not; (iii) approximately 13 % of papers have included a non-parametric test in the experimental protocol, being the McNemar's (5.67 %) and Wilcoxon's signed-ranks (3.55 %) tests the two most common techniques; (iv) only three papers (Canbas et al. 2005;Abdou et al. 2008;Abdou 2009a) have studied the statistical difference of variances through Bartlett's, Levene's or Cochran's C tests; and (v) the post hoc tests for comparisons with a control algorithm have seldom been applied, with only seven works using the Tukey's method (Pendharkar 2005), the Nemenyi's test (García et al. 2012;Marqués et al. 2013;Brown and Mues 2012), the Holm's test (Hu and Chen 2011) or the Bonferroni-Dunn's procedure (Marqués et al. 2012a, b).
From a practical point of view, it is possible to underline two scenarios with regard to the statistical testing of experimental results. First, the single-problem analysis involves the  Year-wise distribution of papers according to the usage of statistical tests comparison of two or more algorithms over a unique database in terms of some metric(s). On the other hand, the multiple-problem analysis is related to the study of two or more algorithms over a number of data sets simultaneously, in terms of some performance score(s). Each of these two cases can be handled through suitable hypothesis testing methods, but we have observed that many papers simply present a matrix of tests comparing all pairs of models and then report a list of conclusions about the statistical significance for each pair. However, this kind of analysis is of little value because a proportion of the null hypotheses can be rejected by random chance (Demšar 2006).
The year-wise analysis of articles illustrated in Fig. 5 reveals that there are not substantial differences in the application of statistical tests across the years. As can be seen, regardless of the year, a considerable majority of studies have not applied any hypothesis testing procedure. On the other hand, although the percentages of papers using parametric and nonparametric methods are very similar, the latter seems to gain some slight advantage in the last years. This suggests that the need of using appropriate tests begins now to be better understood by the research community in this field.
Despite the t-test corresponds to the most often used method for assessing the statistical significance of differences, it has been misapplied in quite a lot of studies. The most typical deficiency is that those works do not check for normality of data (e.g. Ben-David and Frank 2009;Sun 2009, 2013;Martens et al. 2007;Tsai and Wu 2008;Lu et al. 2013;Sun and Shenoy 2007). Another problem refers to the fact that several works employ this parametric test to compare multiple algorithms (e.g. Ravisankar et al. 2010;Tsai and Wu 2008;Ribeiro et al. 2012), even though not being suitable to carry out this type of comparisons.

A straightforward experimental analysis
A couple of experimental scenarios are carried out in a more descriptive way in order to illustrate the importance of using a certain experimental methodology or another. We must remark that this study does not intend to select the best approach, but presenting an overview on how the different experimental set-ups affect the conclusions. First we analyze how the performance evaluation metrics affect the conclusions derived from the results. To this end, we use the Iranian bank database (Sabzevari et al. 2007), which consists of 950 records of "good" customers and 50 samples of "bad" customers; therefore, this is an example of medium sized data set with a very strong imbalance. The second scenario is intended to show the effect of the data splitting methods over the performance of the prediction models, using in this case two benchmarking databases of different sizes: a small data set with bankruptcy information of 120 Polish companies recorded over a two-year period giving a total of 240 records (Pietruszkiewicz 2008), and the large UCSD data set with 2435 records, which corresponds to a random subset of the original database used in the 2007 Data Mining Contest organized by the University of California San Diego and Fair Isaac Corporation.
In both scenarios we have run four different prediction models: the k-nearest neighbor (k-NN) classifier with k = 1, the C4.5 decision tree, a support vector machine (SVM) with the linear kernel function, and a multi-layer perceptron (MLP) neural network with 10 hidden layers.

Comparing several performance evaluation metrics
The purpose of this first case study is to show the importance of using an appropriate performance evaluation measure when the data set suffers from a severe class imbalance, which has been recognized as a very common problem in the domain of credit risk and bankruptcy prediction. For this experiment, we have applied the stratified 10 × 5-fold cross validation resampling method and calculated some of the most widely-used metrics according to our discussion in Section 5. This case represents a quite good example to illustrate that different measures can make different decisions about which algorithm is the best performing model. In addition, it allows to demonstrate that several metrics are worthless when one class is more important than the other because of the unequal costs associated with each class. Under these conditions, as already stated in Section 5, the risk for false positive (type-I) errors is usually much higher than the risk for false negative (type-II) errors.
The first observation from the results reported in Table 6 is that accuracy does not reflect the true performance of each classifier because it is biased with respect to data imbalance and proportions of correct and incorrect predictions. In fact, this measure suggests that the SVM is the best performing model, but it appears evident that its superiority comes from disregarding the minority class (type-I = 1) and assigning all samples to the majority class (type-II = 0). On the contrary, AUC, specificity and precision seem to provide a better performance evaluation in this skewed scenario since these measures propose the 1-NN classifier as the best alternative, which also corresponds to the model with the lowest type-I error and still a moderate type-II error rate. Finally, as expected from its definition, sensitivity behaves similar to accuracy and therefore it also becomes useless for performance evaluation of this strongly imbalanced data set. In this second case study, the objective is to answer the following question: Do different data splitting methods give rise to different performance results? To this end, we compare 1-NN, C4.5, SVM and MLP on the two aforementioned benchmarking databases in terms of accuracy. Table 7 summarizes the accuracy rates achieved by each of the four prediction models when using four different resampling methods with stratification: holdout (with 70/30 splits for training and test data), 5-fold cross validation, 5 × 10-fold cross validation and leaveone-out. Although the results seem quite similar independently of the data splitting method used, one can see that all classifiers achieve the highest accuracies with 5 × 10-fold cross validation and leave-one-out for the small bankruptcy database. In the case of the UCSD data set, as it has sufficient samples to form both training and test sets, all methods except the holdout approach appear to be equally valid and reliable.

Some final guidelines
From the discussions given in the previous sections, a number of recommendations or guidelines for researchers and practitioners who are interested in credit risk and corporate bankruptcy prediction can be suggested. Although we do not intend to introduce stringent requirements for the design of experiments and the validation of performance results, we believe it is of utmost importance to outline a general framework with a set of key questions that leads to statistically reliable conclusions, allows for consistent comparisons among different works and supports reproducibility. Hopefully, the final guidelines given in this section will play an important role in future studies for improving rigor and objectivity of the research progress in this field.
Ideally, the empirical study of a research work should contain a mixture of benchmarking and application-oriented databases in order to profit from both views. This is especially important for this domain because the socio-economic and political dynamics of change may strongly affect the performance of the prediction models. Apart from our familiarity with the benchmarking data, which allow for an easy comparison of the performance results reported in different papers, they are also a valuable resource available for any researcher who is interested in credit scoring and corporate bankruptcy prediction. However, despite most of these data were gathered from real-life applications, it is apparent that they may represent outdated conditions of only a small portion of all possible real situations; therefore, it is not correct to generalize from the benchmarking data sets to any other data. On the contrary, application-oriented databases allow us to explore different features of the current socio-economic and political circumstances, but it may be quite difficult to access them and their knowledge is usually scarce.
The second issue to take into account refers to data splitting. Most researchers in the field apply holdout or K-fold cross validation (with K = 5 or K = 10), sometimes simply because both these are well-known and widely-used techniques. However, these methods may also present a series of limitations which should be taken into consideration carefully in order to ensure that they are certainly the most appropriate for a specific problem. In practice, it will usually be better to adopt the iterative versions of these procedures due to the generally small data set size in this kind of applications. Besides, one should take care of keeping the prior class probabilities in all partitions by applying stratification when splitting the data, thus avoiding the risk of producing a subset with no samples from some class. In the case of credit risk and bankruptcy prediction, it is very common to find small and medium sized databases with imbalanced class distribution, making even much more critical the decision of which approach to use in a proper way. Apart from these factors, it has to be noted that the choice of a data splitting method also relies on other elements such as the nature of the classifiers and the complexity of the problem.
As suggested by several researchers (Japkowicz and Shah 2011), more than a single performance score should be calculated to establish the worth of a classification model because a scalar metric cannot capture all important aspects of an algorithm. Following this recommendation in the domain of credit risk and bankruptcy prediction, the inclusion of type-I and type-II errors, probably along with other performance metrics, becomes especially important due to the unequal misclassification costs for false positives and false negatives. Another issue that should also be taken into consideration when choosing a performance evaluation measure relates to the intrinsic data characteristics because some of these may disguise the true performance of any prediction model; for instance, if the data set is skewed, which is very often in this application domain, one should discard the use of those metrics that are strongly biased towards the majority class and opt for more suitable measures.
Despite the importance of validating the performance results, we have seen that many papers lack of any statistical test of significance, while others usually apply some test without much concern about the assumptions upon which it depends. For the application of a hypothesis testing method, researchers should pay much attention to the problem they are dealing with, that is, consider the number of prediction algorithms and the amount of experimental databases and identify the distribution of data; otherwise, the statistical test used to validate the performance results may provide misleading conclusions. The analysis carried out reveals that many papers have experimented with a unique data set, which considerably limits the number and the type of testing methods that can be applied correctly.
Finally, four simple guidelines for a more complete, robust experimental design should be kept in mind: (i) to use various databases, both benchmarking and application-oriented ones; (ii) to apply an appropriate data splitting technique according to the data set size, while preserving the prior class probabilities; (iii) to choose the scalar performance metrics depending on the data characteristics and the requirements of the problem; and (iv) to validate the results with correct statistical tests of significance taking into account the problem in hand. Although these four recommendations are indeed quite general and common to many other application domains, it is worth remarking that a large proportion of the studies analyzed have not applied them properly. On the other hand, some of these guidelines result critical in the context of credit risk and corporate bankruptcy prediction because of the special characteristics mentioned in Section 1. In addition, this is an interdisciplinary domain with researchers from very different areas, some of which are not comfortable with this kind of experimentation. These are the reasons why we believe it is still important to highlight that the experimental methodology in this field should take all these issues into consideration.

Conclusions
This paper has reviewed a representative sample of journal articles published between 2000 and 2013 in the context of credit risk and corporate bankruptcy prediction in order to gain insight into this subject. Unlike standard reviews in the related literature, the main objective of this work has primarily been to study and analyze the current practice in experimental design and validation of performance results, putting the emphasis on four critical components: databases, data splitting methods, performance evaluation metrics and statistical tests of significance. The relevance of this issue comes from the fact that a well-specified experimental set-up allows to reproduce the experiments by other researchers. As a by-product, however, this surveys can also be useful for new practitioners who are interested in knowing the state-of-the-art in credit and bankruptcy risk evaluation processes.
Regarding the data used in the experiments, it has been observed some shortcomings. First, the number of public databases available for experimentation is limited, thus making difficult the comparison of models between different researchers. Second, the data set size in terms of number of samples is usually small, which may increase the variance of the results (and these are more affected by chance). Third, most papers experiment with a unique database and therefore, the conclusions from these studies should be taken with caution because they may rely on the particular characteristics of such a single database.
According to the review carried out, it appears that nearly all studies have implemented some kind of data splitting, being holdout and K-fold cross validation the most frequently used procedures. However, even though the small size of many databases and the need for multiple replications, the repeated holdout and the K 1 ×K 2 -fold cross validation have been adopted only in a very few studies. We have also found that several works do not specify which data splitting technique has been employed, thus making impossible to reproduce the experiments and acquire a complete understanding of the correctness and consistency of the empirical results. Stratification plays an important role in resampling, but our analysis has revealed that most papers neglect this issue in the realm of credit scoring.
When analyzing the performance evaluation metrics, we have seen that a vast majority of papers have used the accuracy or the error rate, even with class imbalanced data and different misclassification costs. The problem in these cases is that the biased behavior of accuracy (and error rate) may induce misleading conclusions about the worthiness of a prediction model. Several works have tried to overcome this drawback by including also the type-I and type-II errors, which allow to assess the performance on each individual class. Another question related to model performance refers to the expected misclassification cost, which has rarely been considered in these papers in spite of the unequal costs associated to false negatives and false positives. As a final statement, it has seemed clear enough that one should choose the most appropriate performance assessment metric taking into account the particular characteristics of the data onto which a prediction model will be applied.
Finally, this survey of papers suggests that the use of statistical tests of significance is not very frequent yet. Some studies show only means and standard deviations with no hypothesis testing to conclude that one model performs better than the others, whereas some other papers apply a parametric test (mostly the t-test) without checking for normality of data. Only a few number of works have employed some non-parametric test, especially the McNemar's and Wilcoxon's methods.

Appendix A: Credit databases
The databases used in the experiments of the studies here analyzed are presented in Table 8. For each database, we report the number of samples, the number of independent variables and the papers in which it has been employed.