A New approach for Classification of Highly Imbalanced Datasets using Evolutionary Algorithms

Home >> Journal >> IJSER

International Journal of Scientific and Engineering Research

ISSN Online 2229-5518

ISSN Print: 2229-5518 7

Website: http://www.ijser.org

IJSER >> Volume 2, Issue 7, July 2011 Edition

A New approach for Classification of Highly Imbalanced Datasets using Evolutionary Algorithms

Full Text(PDF, 3000) PP.

Author(s)

Satyam Maheshwari, Prof. Jitendra Agrawal, Dr. Sanjeev Sharma

KEYWORDS

classification, data mining, evolutionary algorithm, imbalanced datasets, re-sampling, samplings, support vector machine

Today's most of the research interest is in the application of evolutionary algorithms. One of the example is classification rules in imbalanced domains. The problem of Imbalanced data sets plays a major challenge in data mining community. In imbalanced data sets, the number of instances of one class is much higher than the others, and the class of fewer representatives is of more interest from the point of the learning task. Traditional Machine Learning algorithms work well with balanced data sets, but not able to deal with classification of imbalanced data sets. In the present paper we use different operators of Genetic Algorithms (GA) for over-sampling to enlarge the ratio of positive samples, and then apply clustering to the over-sampled training dataset as a data cleaning method for both classes, removing the redundant or noisy samples. The proposed approach was experimentally analyzed and the experimental results shows an improvement in the classification measured as the area under the receiver operating characteristics (ROC) curve.


References

[1] M.A. Mazurowski, P.A. Habas, J.M. Zurda, J.Y. Lo, L.A. Baker, and G.D. Tourassi. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks, 21(2-3):427- 436,2008. [2] Y. M. Huang, C. M. Hung, and H. C. Jiau. Evaluation of neural networks and data mining methods on a credit assessment tasks for class imbalance problem. Nonlinear Analysis: Real World Applications, 7(4):720-747, 2006. [3] Y.H Liu and Y.T. Chen. Face recognition using total marginbased adaptive fuzzy support vector machines. IEEE Transactions on Neural Networks, 18(1):178-192,2207. [4] Chawla N.V., Japowicz N., Kolcz A., Editorial:Special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1), 1-6 ,2004. [5] Phua C, Alahakoon D, Lee V. Minority report in fraud detection: Classification of skewed data[J]. SIGKDD Explore,,6(1):50- 59,2004. [6] Del Castillo M D, Serrano J I. A multi strategy approach for digital text categorization from imbalanced documents[J]. SIGKDD Explorer,6(1):70-79,2004. [7] Turney P D. Learning algorithms for keyphrase extraction [J]. Information Retrieval,,2(4):303-336,2000. [8] Ling C X, Li C. Data mining for direct marketing: Problems and solutions[J]. Knowledge Discovery and Data Mining,.73- 79,1998. [9] Ivan Tomek (1976). “Two Modifications of CNN”. IEEE Transactions on Systems,Man, and Cybernatics, Vol. 6, No. 11, pp.769-722,1976. [10] Miroslav Kubat,Matwin Stan, “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection”, the 14th International Conference on Machine Learning, pp. 179-186. [11] Peter E. Hart 1968. “The Condensed Nearest Neighbor Rule”. The IEEE Transactions on Information Theory, Vol. 14, No. 3, pp.515-516,1968. [12] Chawla N V, Hall L O , Bowyer k W, et al. SMOTE: Synthetic Minority Oversampling Technique[J]. Journal of Artificial Intelligence Research, 16:321:357,2002. [13] Gustavo E.A. P.A. Bastista, Prati Ronaldo C., Monard Maria Carolina. “ A Study of the Behavior of Several Methods for Balancing machine Learning Training Data”. ACM SIGKDD Explorations Newsletter, Vol. 6, No. 1, pp.20-29,2004. [14] Charles Elkan, “The Foundations of Cost-Sensitive Learning,” the Sevnteenth International Joint Conference on Artificial Intelligence, pp. 973-978. [15] Pedro Domingos, “Metacost: A General Method for Making Classifiers Cost-Sensitive,” the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 155-164. [16] Y. Freund and R. E. Schapire. “A decision-theoretic generalization of on-line learning and an application to boosting”. Journal of Computer and System Science, 55(1):119-139, 1997. [17] Andrew P Bradley. “The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms”. Pattern Recognition, Vol. 30, No. 7,pp.1145-1159,1997. [18] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. University of California, Irvine, School of Information and Computer Sciences. URL http://www.ics.uci.edu/~mlearn/MLRepository.html. [19] The Weka Machine Learning Workbench. http://www.cs.waikato.ac.nz/ml/weka. [20] Weiss G., Mining with rarity: a unifying framework. SIGKDD Explorations 6(1), 7-19 2004. [21] Platt J.C. Fast training of support vector machines using sequential minimal optimization. In: scholkopf B, Burges C, Smola A, eds. Advances in Kernal Methods Support Vector Learning. Cambridge, MA: MIT press. 185-208, 1999.

Untitled Page