Dissimilarity-Based Linear Models for Corporate Bankruptcy Prediction

Bankruptcy prediction has acquired great relevance for financial institutions due to the complexity of global economies and the growing number of corporate failures, especially since the world financial crisis of 2008. In this paper, the problem of corporate bankruptcy prediction is faced by means of four linear classifiers (Fisher’s linear discriminant, linear discriminant classifier, support vector machine and logistic regression), which are designed on the dissimilarity space instead of the classical feature space. Experimental results indicate that the prediction methods implemented with the dissimilarity representation perform considerably better than the same techniques when applied onto the feature space, in terms of overall accuracy, true-positive rate and true-negative rate.


Introduction
In brief, bankruptcy refers to financial failure of a corporate or an individual, which not only leads to significant costs to shareholders and creditors but also may result in a considerable macroeconomic impact (Altman 1993;Zopounidis and Dimitras 1998). In order to avoid the financial losses associated with the failure, financial analysts have long seen the need for the early discovery of bankruptcy. This is the main reason why bankruptcy prediction is deemed as a subject of key relevance for financial institutions. As a consequence, improving the performance of existing techniques and building highly effective models have attracted the attention of many researchers and practitioners (Aziz and Dar 2006).
A vast amount of techniques have been developed to help decision-makers and analysts in predicting financial failure. The most traditional approaches have been based on statistical and operational research methods (Balcaen and Ooghe 2006), such as factor analysis (West 1985), linear and multivariate discriminant analysis (Altman et al. 1977;Karels and Prakash 1987), logit analysis (Ohlson 1980;Jones and Hensher 2004;Tseng and Lin 2005), probit analysis (Zmijewski 1984), linear and quadratic programming (Kwak et al. 2012), and data envelopment analysis (Cielen et al. 2004;Premachandra et al. 2009).
After the Basel II recommendations issued by the Basel Committee on Banking Supervision in 2004, financial institutions realized the need of using more complex systems based upon computational intelligence techniques. Unlike the statistical models, these methods do not assume any specific prior knowledge, but automatically extract information from past observations. Kumar and Ravi (2007) reported a comprehensive review of statistical and computational intelligence methods in the context of bankruptcy prediction. Among some other techniques, support vector machines (Shin et al. 2005;Min and Lee 2005;Erdogan 2013), genetic and evolutionary algorithms (Lensberg et al. 2006;Acosta-González and Fernández-Rodríguez 2014), artificial neural networks (Wilson and Sharda 1994;Sun and Shenoy 2007;Cleofas-Sánchez et al. 2016;Zhao et al. 2016), rough sets (Slowinski and Zopounidis 1995;Mckee 2000), and hybrid and classifier ensembles (Verikas et al. 2010;Fedorova et al. 2013;Abellán and Mantas 2014;Tsai 2014) have received much attention. Many works have empirically compared and contrasted these soft computing methods (Alfaro et al. 2008;Chen 2012;Olson et al. 2012;Erdal and Ekinci 2013;Tsai et al. 2014).
All these statistical and computational intelligence techniques applied in the field of bankruptcy prediction are based on the assumption that samples are represented by a set of features (explanatory variables), which defines a feature space. These features usually correspond to financial ratios and/or macroeconomics indicators, either repre-sented as continuous variables or discretized in a straightforward manner as qualitative information. However, in a few cases the samples are described by means of qualitative variables whose values are gathered from expert judgments (Kim and Han 2003).
Apart from the feature space, there exist other approaches to pattern representation that could also be exploited for very distinct financial applications. One is the dissimilarity representation, in which samples to be classified/predicted are derived from pairwise dissimilarities (distances from other samples in the data set) . The justification for constructing classifiers in a dissimilarity space is that a dissimilarity measure should be small for similar samples and large for distinct samples, thus allowing for efficient and more reliable discrimination of classes. Another important characteristic is that the dimensions of a dissimilarity space symbolize homogeneous types of information and therefore, all dimensions can be considered as equally relevant. On the other hand, for a complex problem, a simple linear prediction model in a dissimilarity space could separate the classes more easily than the same classifier in a feature space .
Taking into account the practical advantages of the dissimilarity representation over the classical feature-based one (Pelillo 2013), this paper faces the problem of corporate bankruptcy prediction in a way different from that traditionally followed by the methods reported in the literature. As far as we know, the dissimilarity-based paradigm, which has shown to be truly effective on various real-life problems, has not been applied in the financial scenario. Accordingly, the present paper analyzes the performance of four standard linear classifiers built on the dissimilarity space for the discovery of corporate financial failure using a data set whose explanatory variables are qualitative, and compares them with their feature-based counterparts. The reasons for focusing this study on linear models are threefold (Yuan et al. 2012): (1) they are good handling sparse data; (2) they are easy to describe mathematically, computational simple and easy to interpret; and (3) when applied to dissimilarity data, they often lead to very good performance .
The remaining of the paper is organized as follows. Fundamental concepts related to the dissimilarity representation are summarized in Sect. 2. The prediction methodology proposed in this paper is described in Sect. 3. Next, Sect. 4 introduces the bankruptcy database and describes the experimental set-up. Results are presented and discussed in Sect. 5. Finally, a number of concluding remarks and possible directions for future research are outlined in Sect. 6.

Dissimilarity Space
From a practical viewpoint, the bankruptcy prediction problem can be defined as a binary classification problem where a new input sample has to be categorized into one of the predefined classes based on a number of observed variables or features related to that sample. Formally, it can be described as follows: Given a set of past observations T = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )}, where each example x i is characterized by a vector of m features, [x i1 , x i2 , . . . x im ], and y i denotes the class (bankrupt/nonbankrupt), then the bankruptcy prediction problem consists of constructing a model δ to predict the value y for a new input sample x, that is, δ(x) = y.
Traditional prediction models rely on the description of examples through a set of explanatory variables. A reliable alternative to the feature (variable) space is the dissimilarity space proposed by , in which the dimensions are defined by vectors measuring pairwise dissimilarities between examples and individual prototypes from a given representation set R = {p 1 , . . . , p r }, where r is its cardinality. This set can be chosen as the complete training set T , a set of generated prototypes, a subset of T that covers all classes, or even an arbitrary set of labeled or unlabeled samples (Pekalska et al. 2006). Although the representation set can be selected either in a systematic or in a random way, it has been shown that both strategies produce similar classification results (Duin et al. 1999).
Given a dissimilarity measure d(·, ·), which is required to fulfill the positivity and the reflectivity (d(x i , x i ) = 0) conditions but it might be non-metric, a dissimilarity representation is defined as a data-dependent mapping function D(·, R) from T to the dissimilarity space. This means that every example x i ∈ T can directly be represented by an r -dimensional vector in the dissimilarity space, , that is, each dimension corresponds to a dissimilarity to a prototype from R. Therefore, dissimilarities between all examples in T to R are represented by a matrix D(T, R) of size n ×r , which corresponds to the dissimilarity representation we want to learn from (Pȩkalska and Duin 2005).
In general, a drawback related to the use of features is that completely different examples may have the same feature representation, which results in class overlap (examples that belong to different classes are represented by the same feature vectors). In the dissimilarity space, however, only identical examples (with the same class label) have a zero-distance, which means that there does not exist class overlapping. On the other hand, the dissimilarity-based classifiers may be robust against variations in scale (Duin and Pȩkalska 2012). Note that in principle, any standard classifier can be built on the dissimilarity space in the same way as on the feature space.

Methodology
This section provides a general overview of the complete methodology for constructing the model and classifying new corporate samples. Figure 1 shows a flowchart of the learning and prediction processes for both a classical feature-based representation (black lines) and a dissimilarity-based representation (red lines).
Using a feature-based representation, the learning stage (continuous lines) simply consists of building the classifier with the training set T . In the case of a dissimilaritybased representation, the first step of learning consists of choosing a representation set R, whose prototypes will be used to measure the pairwise dissimilarities to the training examples in T . Next, the training set T is mapped into a dissimilarity space, which will be finally used to build the classifier.
In the testing stage (dashed lines), when a new sample x has to be classified, it is mapped into the dissimilarity space by calculating the dissimilarity between x and all prototypes in the representation set R, which results in a one-dimensional matrix

Database and Experimental Protocol
The database used in the present experiments was taken from the UCI Machine Learning Database Repository (Lichman 2013). This is a subset of samples collected during the period 2001-2002 from one of the largest commercial banks in Korea (Kim and Han 2003). It consists of 250 instances, with about 43% of them labeled as bankrupt. Each sample is represented by explanatory variables that correspond to levels (negative, average, and positive) of six qualitative risk factors (see Table 1) evaluated by loan officers. These risk factors are the ones established and used by the bank in order to estimate the default risk of manufacturing and service companies. Since all these variables were categorical, they were first converted into numeric values (negative = 1, average = 2, and positive = 3) as reported in the paper by Kim and Han (2003), and then these were normalized in the range [0, 1].
Even though the key question of this paper is not to select the most relevant explanatory variables, two feature ranking methods were applied to evaluate the usefulness of each variable: the ReliefF algorithm and the Pearson's correlation-based approach. Operating risk (OP) Volatility and stability of procurement, efficiency of production, and stability of sales  The former evaluates the worth of a variable by repeatedly sampling an instance and considering the value of the given variable for the nearest instance of the same and different classes, whereas the latter evaluates the worth of a variable by measuring the correlation between it and the class. Results in Table 2 indicate that competitiveness (CO) is the most meaningful variable and the industry risk (IR) corresponds to the least relevant feature in terms of both ranking scores. Bearing in mind that the purpose of this study is to compare both feature representations in the field of bankruptcy prediction, not to select the most meaningful variables, the experiments focused on four linear classifiers: the Fisher's linear discriminant (FLD), the linear discriminant classifier (LDC), a support vector machine (SVM) with a linear kernel and the soft-margin constant C = 1.0, and the logistic regression (logit) model (this is considered a classical econometric method that can be viewed as a reference approach for various financial applications). The performance of these techniques was explored both on the feature space (FS) and the dissimilarity space (DS). For the latter case, we chose the representation set R to be equal to a percentage of examples from the training set T , varying from 1 to 50% with a step size of 1. Here two variants were used: (1) the representation set was randomly drawn by picking examples from T without taking care of their class label (R-DS), and (2) the representation set was created by randomly selecting the same proportion of examples from each class (RC-DS).
The common method to evaluate the performance of bankruptcy prediction systems when databases are small or medium sized corresponds to K -fold cross-validation because it appears to be a better estimator than other strategies, such as bootstrap with a high computational cost or re-substitution with a biased behavior (García et al. 2015). Here a stratified fivefold cross-validation was applied: the data set was randomly divided into five stratified blocks of equal size; for each fold, four blocks were pooled as the training set, and the remaining part was used as an independent test set. Thus, the learning procedure was run a total of five times on different training sets and the results from predicting the class of the test samples were averaged across the five trials. Note that stratification allows to preserve the class proportions of the whole data set into each one of the blocks, thus reducing the prior probability of data set shift and the variance in the estimation process (Santafe et al. 2015).
In most financial applications, it is important to assess not only the overall accuracy of the model, but also the true-positive and true-negative hits because the misclassification costs are usually asymmetric (the cost of predicting a bankrupt sample as non-bankrupt is generally much higher than the opposite situation) (Caouette et al. 2008). The true-positive rate (or sensitivity) is the proportion of positive samples that are correctly predicted, whereas the true-negative rate (or specificity) is the proportion of negative cases that are correctly predicted. Note that we have considered that the bankrupt examples shape the positive class and the non-bankrupt ones form the negative class.

Results and Discussion
Figures 2 and 3 display the accuracy, the true-positive rate (TPr) and the true-negative rate (TNr) averaged across the five runs. For each prediction model, we have plotted the results for the feature space and also the results of the two variants for the dissimilarity space when varying the percentage of examples from T that have been chosen to generate the representation set R. Note that the line parallel to X -axis corresponds to the case of the feature space, which indicates that the results do not depend on the size of R because they were achieved by learning directly from the training set T . These plots show that the models built with any of both approaches to the dissimilarity representation perform much better than the respective feature-based classifiers.
If the focus is on the plots of Fig. 3, it is remarkable and important to notice that differences between the dissimilarity space and the feature space are especially significant in the case of the true-positive rate, which refers to the number of hits on the most critical class because of the high cost of failing in the prediction of bankrupt samples.
When comparing R-DS and RC-DS, the plots in Figs. 2 and 3 indicate that in general, there do not exist differences in prediction performance, independently of the classifier used. However, when the percentage of prototypes is less than 5%, the option of generating the set R with the same proportion of examples from each class (RC-DS) performs slightly better than the R-DS variant.
Tables 3 and 4 report a summary of the experimental results for 10, 20, 30, 40 and 50% of prototypes used to built the representation set. As can be observed, using a dissimilarity space instead of a feature space consistently produces considerable gains in terms of accuracy, true-positive rate and true-negative rate. In the case of accuracy, whilst the performance of the prediction models on the feature space is about 51-58%, that on the dissimilarity space is about 96-99%. Differences are even more significant when the performance is assessed by means of the true-positive rate, especially with the Fisher's linear discriminant model. On the other hand, various configurations of the dissimilarity representation yield 100% of true-negative rate. These results support the claim that the linear models generally lead to very high performance when they are built on the dissimilarity space.
To gain some insight into these results, we have projected the data onto a twodimensional subspace through PCA. Figure 4 shows the scatter plots of the original feature space and the two variants of the dissimilarity space (for the percentages of prototypes reported in Tables 3, 4). In addition, as the size of the original training set is 250 × 6 (250 examples and 6 explanatory variables), we have also included the scatter plots of the dissimilarity representations obtained by random selection of six examples, which results in a matrix D(T, R) of size 250 × 6. By this, one can compare the class distribution on both spaces under identical conditions (sizes).
As can be seen in Fig. 4, the overlap between bankrupt and non-bankrupt examples is very high in the feature space, whereas both dissimilarity-based variants give rise to good separability between classes, irrespective of the size of the representation set R. The lack of separability between classes in the feature space may result in many false-positives or false-negatives, which helps to explain the low performance of the prediction models when they were applied on this space.

Conclusions and Future Work
In the present study, we have explored the feasibility of applying the dissimilarity representation to effectively discriminate between bankrupt and non-bankrupt companies. To this end, four well-known linear prediction techniques (FLD, LDC, SVM and logit) have been implemented both on the feature space and the dissimilarity space and tested over a database generated by a commercial bank in Korea. The experimental results have demonstrated that all the linear models here analyzed for bankruptcy prediction perform clearly better on the dissimilarity space than on the feature space in terms of accuracy, true-positive rate and true-negative rate. Projection of data onto a two-dimensional subspace has shown that the dissimilarity representation provides significantly higher separability between classes than the orig- inal feature representation, which allows to understand why the dissimilarity-based prediction models outperform their feature-based counterparts.
In the future, it would be of interest to perform further simulation studies that compare linear and non-linear prediction models on both the dissimilarity and the feature spaces. Other research directions might include the application of the methodology described in this paper to analyze the effects of class imbalance and data set shift on the dissimilarity-based models for bankruptcy prediction or even for other economic and financial problems. A final avenue for further research is to study the applicability of the dissimilarity representation to select the most relevant explanatory variables. This is a non-trivial problem that may require a significant effort, but deserves to be taken into account.