Author Topic: A New approach for Classification of Highly Imbalanced Datasets using Evolutiona  (Read 2000 times)

0 Members and 1 Guest are viewing this topic.

IJSER Content Writer

  • Sr. Member
  • ****
  • Posts: 327
  • Karma: +0/-1
    • View Profile
Author : Satyam Maheshwari, Prof. Jitendra Agrawal, Dr. Sanjeev Sharma
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011
ISSN 2229-5518
Download Full Paper : PDF

Abstract— Today’s most of the research interest is in the application of evolutionary algorithms. One of the examples is classification rules in imbalanced domains. The problem of Imbalanced data sets plays a major challenge in data mining community. In imbalanced data sets, the number of instances of one class is much higher than the others, and the class of fewer representatives is of more interest from the point of the learning task. Traditional Machine Learning algorithms work well with balanced data sets, but not able to deal with classification of imbalanced data sets. In the present paper we use different operators of Genetic Algorithms (GA) for over-sampling to enlarge the ratio of positive samples, and then apply clustering to the over-sampled training dataset as a data cleaning method for both classes, removing the redundant or noisy samples. The proposed approach was experimentally analyzed and the experimental results shows an improvement in the classification measured as the area under the receiver operating characteristics (ROC) curve.
Index Terms— classification, data mining, evolutionary algorithm, imbalanced datasets, re-sampling, samplings, support vector machine.

1   INTRODUCTION                                                                     
THE problem of imbalanced data-sets occurs when the majority class has a large percent of the samples, while minority class occupies a small part of all samples. Such a condition pose challenges for classical machine learning algorithms that are designed to optimize oTverall classification accuracy. Imbalanced datasets exists in many domains such as medical applications [1], risk management [2], face recognition [3] and information technology, and so on. In these domains, minority class is of more interest than majority class. In imbalanced data sets, the traditional way of max-imizing overall performance will often fail to learn any-thing useful about the minority class, because of the dominating effect of the majority class. A learner can probably achieve 99% accuracy with ease, but still fail to correctly classify any rare examples. Therefore, analyzing the imbalanced data sets (IDS) problem requires new and more adaptive methods than those used in the past.
In this paper we over-samples the minority class by mutation and crossover operators to decrease the imbal-ance ratio and then using clustering for both classes to delete redundant and noisy samples. Thus, by combining the both method the samples of interest are remained, improving the computational efficiency.
The contribution is organized as follows: Section 2 introduces the problem of imbalanced data sets, describing its feature, how to deal with this problem. Next, in section 3 we will expose the related work done in this field. Section 4 describes the characteristics of our proposal. Section 5 contains the measure of performance evaluation of imbalanced datasets. Section 6 analyses the experimental results. Finally, conclusion and future work will be pointed out in section 7.

Learning from imbalanced data is an important topic that has recently appeared in Machine Learning Community [4]. Imbalanced data sets can occur in many real-world applications, such as detection of fraudulent telephone calls [5], text classification [6], information retrieval and filtering tasks [7], data mining for direct marketing [8], and so on. The problem of imbalanced datasets in classification occurs when the number of instances of one class is much lower than that of the other classes. Specifically, when the datasets has only two classes, this happen when one class is represented by a high number of examples, while the other is represented by only a few and usually the minority class represents the concept of interest.

Traditional classifier algorithms are more biased to-wards the majority class (Negative Samples), since the rules that predict the higher numbers of examples are positively weighted during the learning process in favors of the accuracy metric. Consequently, the samples that belong to the minority class (Positive Samples) are more misclassified than often those belongings to the majority class [20]. 
Imbalanced datasets faces many challenges; the first challenge is measure of performance. To overcome this problem, Evaluation metrics are used to guide the learn-ing process towards the desired solution. The second challenge is lack of data. If a class may have very few samples, then it is very difficult to construct accurate decision boundaries between classes. The third challenge is noise. Noisy data have a serious impact on minority classes than on majority classes. Furthermore, classical machine learning algorithms tend to treat samples from minority class as a noise.

Read More: Click here...