Equilibrating the Recognition of the Minority Class in the Imbalance Context

In pattern recognition, it is well known that the classifier performance de pends on the classification rule and the complexities presented in the data sets (such as class overlapping, class imbalance, o utliers, high-dimensional data sets among others). In this way, the issue of class imbalance is exhibited when one class is less represented with respect to the other classes. If the classifier is trained with imbalanced data sets, the natural tendency is to recognize the samples included in the majority class, ignoring the minority classes. This situation is not desirable because in real problems it is necessary to r ecognize the minority class more without sacrificing the precision of the majority class. In this work we analyze the behaviour of four classifiers taking into a count a relative balance among the accuracy classes.


Introduction
In pattern recognition, the class imbalance is as a big classification problem. In this context, the classifiers commonly assume that the distribution of classes are balanced in the data sets; this situation in real problems is not true (detection of oil spills, medical diagnosis, face recognition, among others [1]). For example, in a medical problem when the amount of healthy cases (900) included in the majority class is higher than the ill cases (100) included in the minority class. Both classes are important, but in this example the classifier may skew its learning to the majority class, and as a consequence the patterns of the minority class will be ignored [1].
The class imbalance problem in some cases is correlated with other problems in the training data, such as class overlapping, size data sets, small disjoint, high dimensionality, and others [2]. Also the classifier behaviour depends on the classification rule. For example, some algorithms generalize knowledge such as algorithms for training trees (C4.5) [3], neural networks, support vector machine among others. This situation is presented mainly when the training data set is imbalanced due to the tendency to assign a certain test sample to the most represented class [43], [5], [1], [6].
This study is focusing on four models which generalize knowledge: three neural networks and one associative memory. Artificial Neural Networks (ANN) are mathematical models inspired in the functioning of the human brain, simulating the interconnection existing between the neurons, which allow the information process. The learning process of the Artificial Neural Networks is realized in parallel through the interconnection made between the node layers. For the majority, it is not necessary that the neural network is trained twice, and their knowledge is obtained with the adjustment of the weights [7]. In this sense, some network models are very useful in classification issues, such as the Bayesian Network, the Multilayer Perceptron and the Radial basis Functions Networks [8] and [9] and [10].
On the other hand, the associative Memories have the ability of correctly recovering the input patterns; for this, the associative models take into account two phases: learning phase and recovery phase. In the first phase, the associative memories show their learning as a matrix, which represents the associations made among the input patterns (vectors of n components or features) and the output patterns (classes). In the second phase, the input patterns are recovered [11].
Some approaches proposed for handling the imbalance problem are focused on increasing the amount of samples in the minority class (over-sampling), diminishing the amount of samples in the majority class (under-sampling) or biasing the classifier behaviour in the training step in order to identify the minority class better [1]. The first method randomly duplicates minority samples with the aim of making a balance in the classes. The second method randomly eliminates majority class patterns. The third method consists in modifying the cost associated with the erroneous minority class classification [3].
The previous techniques are widely used. However, all are not considered to obtain a relative balance between the performances of each class. That is to say, in some cases, to apply a certain method can invert the imbalance. Therefore, the majority class becomes a minority and the minority becomes the majority [12]. This situation is not desirable, because the imbalance problem was not resolved, only inverted.
Taking in to account the neural approach, some works have been performed in the class imbalance context. In this sense, the researches made by [13] show an improvement in the imbalanced data classification considering the method called Principal Component Analysis (PCA) before adding Gaussian noise in the samples used to network learning. In another work, the redundant samples belonging to the majority are eliminated through the method called stochastic sensitivity measure, and in this manner improve the performance of the Radial Basis Function Neural Network class [14].
Few works have been found in relation with the Associative models and the issues implicit in the data sets such as class imbalance, outliers, high-dimensional data sets among others. The first work analyzes the performance of the HACT model taking into account the geometric mean and under-sampling methods. This is made on eleven imbalanced data sets [15]. On the other hand, [16] have considered feature selection methods to try the data sets before training the HACT model.
In terms of a balanced recognition between the class rates, this work analyzes the behaviour of four models which generalize knowledge. Specifically, the Hybrid Associative Classifier with Translation (HACT) and the three well known neural classifiers (such as Bayesian Network, Multilayer Perceptron and Radial Basis Functions Network) are considered. Experiments with thirteen data sets of real-life, show that the better classifiers performance is more noted when a previous preprocessing in the imbalanced data sets is made. These results are obtained considering a balanced recognition. In this sense, the accuracy of the minority class is increased without significantly diminishing the accuracy of the majority class.
The paper is structured as follow, in section 2 the HACT model is exhibited; in section 3 the neural models are described. In this way, the preprocessing methods are shown in section four. Then, the experimental set-up and experimental results are presented in section five and six. Finally, the main concluding remarks are expressed in section seven.

Hybrid Associative Classifier(HACT)
The HAC model combines two associative memories: Learn Matrix and Linear Associator. The first associative memory requires that the input patterns must be binary vectors. The second associative memory necessitates that the input patterns must be orthonormal vectors. Those aspects are considered as disadvantages of those models. Therefore, the HAC model arose to cover those drawbacks. Additionally, the model considers a low computational cost in its process of recognition [11].
The disadvantage of the HAC model is presented when some input class patterns have a big magnitude in comparison to the magnitude of other input patterns belonging to another class. In this case, the input patterns with less magnitude will be assigned to the class of those patterns with a bigger magnitude. To correct the limitations, of the HAC associative model, the translation of axis was implemented in the HACT model. The translation of axis occurs when parallel axis are found.
To carry out the procedure of the HACT associative model, the mean vector is obtained from all input patterns. The mean vector works as the centre of a new axis coordinate. In this way, a new data set is generated. The mean vector is obtained through x = 1 p ∑ p j=1 x µ , and the translation of axis is made with x µ ′ =x µ -x [11].
The HACT associative model obtained its learning taking into account the first phase of the Linear Associator model [17] where the external product is utilized to obtain the associations among input patterns and output patterns. The final matrix represents the learning of the HACT model, which is obtained through the sum of all external products: The recovery phase of the HACT model is made through the second phase of the Learnmatrix associative model: using the matrix obtained in the learning phase of the HACT model and the input patterns.

Neural Networks
The following subsections three neural networks such as the Bayesian Network, the Multilayer Perceptron and the Radial Basis Functions Network being described. Their main characteristics involve the following aspects: the first network considers the probability theory for its learning, the second network takes into account a single hidden layer in its topology and the third network uses function nodes in its hidden layer.

Bayesian Network
The probabilistic approach called Bayesian Network (BN) was developed by Pearl in 1980. This has been widely used in pattern recognition as a robust classifier. The NB operation is realized through a network structure, taking into account the conditional probability (considering an aprior knowledge) in their training and considering the Bayesian theorem in the classification [25] and [19] and [20] and [21]. The BN approach can be seen as: where the DAG represents a directed acyclic graph topology and the symbol "P" indicates the conditional probabilities.
Besides, it is of great importance to mention that the approach exhibits the best variable probability, which is distributed throughout the network. Each random variable (events) is represented as an independent network node [22] and [23] and [24]. Additionally, BN cannot obtain a best network structure when there is a high dimension in the features space [25].

Multilayer Perceptron
The networks model has been widely used in pattern recognition for its generalization ability. In this case, the Multilayer Perceptron (MLP) was developed as a nonlinear network model organized by layers such as the input layer, the hidden layer and the output layer. The first layer is integrated by input units that represent the attribute examples. The nodes of the second layer allow obtaining several decision boundaries and these are combined to obtain a classification decision. Finally, in the output layer, all output nodes have a zero value except in the node that indicates the class [26] and [27] and [28].
The training of MLP network has been widely performed with backpropagation taking into account the gradient descent in the error function; minimizing the error function. On the other hand, literature says that if the network training stays in a local minimum then the posteriori probabilities cannot be obtained [27] and [29]. In addition, the classification examples are obtained through the output network nodes.

Radial Basis Functions Network
The Radial Basis Functions Network (RBFN) is a Feedforward network well known in Pattern Recognition, which emerges from research made by Broomhead, Lowe, Lee among other authors [7]. The RBF network topology is formed by an input layer, a hidden layer and an output layer. The hidden layer of the RBF is integrated by Kernel functions nodes (each node is associated with different weights) instead of considering single hidden nodes such as the MLP network. Traditionally, the Basis function used in the nodes of the hidden layer has been the Gaussian function [7]. In addition, the RBF Network is faster in its learning process than that used by the MLP Network [30].
The learning process of the RBF Network takes into account a basis function to map the input samples to the hidden layer nodes. Thus, the function can been seen as φ x − x n , where the symbol φ indicates the non-linear function and the distance (for example the distance Euclidean) is expressed through x − x n [30].
The learning of RBF network is not finished until the parametres are adjusted in the network. In addition, the error must be reduced until an minimum error value is obtained [32].

Preprocessing Methods
Traditionally, the imbalance issue has been tried at algorithm level, at sampling level and using cost-sensitive methods. In the first method, the minority class is handled inside the algorithm. In this case, a modification is made to the algorithm, for this is necessary to know the classifier rule and application domain. Some authors mention that the preprocessing methods are positive solutions to balance the class distribution. When the sampling method is applied it is not important to know the classifier rule inasmuch as the method treats the class imbalance inside data sets [33]. The cost-sensitive technique combines the previous methods, taking into account the cost of misclassification in the learning phase or modifying the algorithm considering the cost on the classification [1].
A preprocessing method included at sampling level is the Smote (Synthetic minority oversampling technique). This approach was proposed by Chawla et al., which is an oversampling method that generates synthetic examples of minority class through a random interpolation [34]. This is performed until a balance among classes is obtained. The procedure to obtain the synthetic examples consists in taking the distance between the current example and one of their k-nearest neighbours (it is selected randomly). After that, the differences vector is multiplied by a value between zero and one. Next, the synthetic examples are incorporated [1]. It is important to mention that the Smote method alleviates the overfitting problem generated by random oversampling methods. This issue is presented when the examples are duplicated and it does not generate new information in the data sets [35]. The Smote method can be seen as [36]: -P minority class.

Description of data sets
The data sets were taken from the KEEL repository (http ://www.keel.es/dataset.php), specifically from the imbalanced data sets section. All data sets are class-two problems with different characteristics such as the imbalance rate (IR), the features dimensionality (or the number of features (F)) and the data sets size (or number patterns (P)). This can be seen in the following Table 1: The data sets are sorted by the level of the class imbalance presented. The IR is obtained by dividing the number patterns of the minority class (Min) between the number of patterns of the majority class (May); this can be seen as IR = Min/May. In literature a high IR is considered when there is a value greater than ten. It is possible to observe in the Table 1 a high imbalance rate on four data sets. In addition, the method k-cross-validation was considered to obtain five partitions of each original data set.

Performace Measures
In this section the measures for checking the performance of the classifier in the imbalance context are described. In this paper, two performance measurements are used such as the geometric mean and the ROC curve (AUC) to evaluate the neural networks and the HACT model performance.
Traditionally, the overall accuracy (Acc) has been used in the balanced data sets context. However, it is not appropriate to use the Acc measure in imbalanced data sets, because the classification model would not consider the correct classification from each class separately. In this way, it is possible to obtain a classification model which reports an accuracy of 90% with a very high imbalance rate. The overall accuracy is expressed as the number of patterns classified correctly (all classes) among the total patterns in the test data sets [41]: where TP and TN indicate the correct classification of the minority and majority classes. The misclassification of both classes is expressed as FP (minority class) and FN (majority class).
On the other hand, a measure which considers the accuracies by class is geometric mean. This measure takes into account a symmetric distribution over the negative recognition rate (TN r=TN/TN+FP) and positive recognition rate (TP r=TP/TP+FN) [41].

MG = (T P r ) * (T N r )
(4) In some cases the geometric mean can obtain a partial solution when some of the rates have a zero value. In this case, the most accuracy is provided by one class. This disadvantage can be resolved with the Area under the ROC (Receiver Operating Characteristics) or AUC. This measure is used in the context of class imbalance, and takes in count the positive classification rate and negative classification rate separately. The AUC can be seen as [42] and [43]:

Experimental results and discussion
In pattern recognition it is of great importance to recognize the minority class. However, this situation is difficult to achieve with imbalanced data sets. In this case, the classifiers tend to bias their learning to the majority class. The purpose of this paper is to analyze the balanced recognition between the TP r and TN r rates without degrading the accuracy of majority class. It is important to mention that this recognition is performed with imbalance data sets. In addition, the balanced recognition is considered when there is a difference of 20% between the accuracy of classes (majority and minority).
In the first section the experimental results without considering a previous preprocessing in the imbalanced data sets are showed. After that, the results obtained with preprocessing methods such as undersampling and oversampling are exhibited. In addition, all results presented in the tables exhibit the average accuracy of the five partitions obtained from each data set taking into account the cross validation method. Finally, the best results obtained by clasifiers are underlined and the relative recognition between the classes is indicated in bold.

Experimental results without preprocessing
This section exhibits the experimental results without a preprocessing in the imbalanced data sets. The values of true positive and true negative rates are shown in Table 2. After that, the results obtained with the AUC and MG measures are represented in Table 3. From these experiments, firstly the results obtained with the CHAT model show a balance between the rates accuracy (TP r and TN r) in all data sets. In this case the class recognition is made without sacrificing the accuracy of the majority class. That situation cannot be observed through all the three neural networks. For example, the Bayesian network reports a balanced recognition among the classes of 61.54% (in eight data sets). And the MLP and RBF networks exhibited a balance of 38.46% (in five data sets) and of 53.85% (in seven data sets) on the balanced recognition. About the neural networks is possible to observe that the BN network shows a better performance with respect of the another two neural models (MLP and RBF).  Table 3 shows the values obtained with the AUC and MG measures. From these result it is possible to observe that the CHAT model is the most benefited when there is a balanced recognition in four data sets. In this way, the neural networks can exhibit maximum benefit in two data sets. For example, the BN network presents its best performance with Glass6 and Shuttle-c0 vs c4 data sets. However, the MLP and RBF networks exhibit their best behaviour in one dataset.

Experimental results using oversampling and undersampling methods
The experimental results obtained with the preprocessing methods are presented in this section; in specific the undersampling (Wilson) Table 4 shows the results by class, that is to say, the accuracy of each class is represented separately. In this way, the CHAT model keeps balanced recognition in the thirteen data sets when the undersampling technique called Wilson is used. However, this situation is not presented with results obtained with the neural networks; the BN, MLP and RBF networks show a balanced recognition of 38.46% (in five data sets) and 30.77% (in four data sets). It is important to mention that the blanced recognition is desirable because the classifier ensures an adequate recognition in both classes (minority and majority). Table 5 shows the AUC and MG values considering a previous preprocessing with Wilson, this is obtained in terms of accuracy by class (balanced). From these experiments it is possible to observe that the HACT model shows its best performance on six (using AUC) and seven (Using MG) data sets in comparison with the other classifiers. This situation cannot be observed with the results obtained without a previous preprocessing. In this way, it is possible to say that the better results are obtained when using the Wilson method. Tables 6 and 7 show the experimental results taking into account a previous preprocessing in the data sets. For this, the oversampling method was used, specifically the  The balanced recognition between TP r and TF r rates was not performed fully with the CHAT model and the three neural networks. In this way, the maximum balanced recognition was reached with the CHAT model and BN network on ten data sets (76.92% in all cases). However, the experiments presented in Table 6 show that the MPL and RBF networks obtain a balanced recognition over eight (61.54% of all cases) and nine (69.23% of all cases) data sets. Despite this, the best balanced recognition between the classes is obtained using the Smote method. This situation cannot be observed with the results obtained without a previous preprocessing or considering the Wilson method.
In the context of a balance recognition between the accuracy of classes, the BN and MLP neural networks demonstrate a better classification performance on four and five data sets when a preprocessing (Smote) in the data sets is performed. This situation cannot be observed with the experiments obtained without a previous preprocessing or taking into account the Wilson method.
In addition, the experimental results obtained with a previous preprocessing (Wilson and Smote) show a better benefit in the classifiers performance when there is a balance between the accuracy of each class. Figure 1 shows the original data set size, as well as sizes of the data sets after performing a previous preprocessing. The axis x corresponds to the data sets, while that the axis y indicates the data sets size. In this Figure    The results show the convenience of use the undersampling method, because with small data sets is possible to obtain a low or high classification performance. A case interesting can be observed with the data set called shuttle-c0 c4; using 13220 samples with the Smote method in comparison with the Wilson method (7311 samples).

Conclusions
In this work was analyzed the behaviour of HACT, BN, MLP and RBF models in the context of a balanced recognition between the classes. The experiments were obtained with preprocessing and without preprocessing methods considering thirteen real-world data sets.
In terms of a balanced recognition, the four classifiers show a situation of great interest when is not considered a previous preprocessing in the imbalanced data sets. The classifiers recognize the minority class without sacrifice the accuracy of majority class in at least a data set. In this sense, the better classification performance in the context of a relative recognition is more emphasized with the HACT model in comparison with results obtained with the other classifiers when it is not considered a previous preprocessing.
In the domain of a balanced recognition, the results obtained with preprocessing methods demonstrated a better behavior in three classifiers such as CHAT, BN and MLP. With this is possible to conclude that the neuronal model needs a balanced recognition to obtain a good classification performance. In addition, it is convenient to use the Wilson method inasmuch as with fewer samples can obtain a good performance by class. In this sense, when the Wilson method is used, the HACT performance improves in comparison with results obtained with the Smote method and without a previous preprocessing. On the other hand, it was possible to observe that the BN and MLP networks performance in the context balanced recognition improve when Smote method is considered. This situation cannot be observed with the experiments obtained with the Wilson method and considering a previous preprocessing.
The open lines pointing out to study another classifiers and to deep in the imbalance study into associative memories context.