A New approach for Classification of Highly Imbalanced Datasets using Evolutionary Algorithms
|
Full Text(PDF, 3000) PP.
|
|
Author(s) |
Satyam Maheshwari, Prof. Jitendra Agrawal, Dr. Sanjeev Sharma |
|
KEYWORDS |
classification, data mining, evolutionary algorithm, imbalanced datasets, re-sampling, samplings, support vector machine
|
|
ABSTRACT |
Today's most of the research interest is in the application of evolutionary algorithms. One of the example is classification rules in imbalanced domains. The problem of Imbalanced data sets plays a major challenge in data mining community. In imbalanced data sets, the number of instances of one class is much higher than the others, and the class of fewer representatives is of more interest from the point of the learning task. Traditional Machine Learning algorithms work well with balanced data sets, but not able to deal with classification of imbalanced data sets. In the present paper we use different operators of Genetic Algorithms (GA) for over-sampling to enlarge the ratio of positive samples, and then apply clustering to the over-sampled training dataset as a data cleaning method for both classes, removing the redundant or noisy samples. The proposed approach was experimentally analyzed and the experimental results shows an improvement in the classification measured as the area under the receiver operating characteristics (ROC) curve.
|
|
References |
|
[1] M.A. Mazurowski, P.A. Habas, J.M. Zurda, J.Y. Lo, L.A. Baker,
and G.D. Tourassi. Training neural network classifiers for medical
decision making: The effects of imbalanced datasets on
classification performance. Neural Networks, 21(2-3):427-
436,2008.
[2] Y. M. Huang, C. M. Hung, and H. C. Jiau. Evaluation of neural
networks and data mining methods on a credit assessment
tasks for class imbalance problem. Nonlinear Analysis: Real
World Applications, 7(4):720-747, 2006.
[3] Y.H Liu and Y.T. Chen. Face recognition using total marginbased
adaptive fuzzy support vector machines. IEEE Transactions
on Neural Networks, 18(1):178-192,2207.
[4] Chawla N.V., Japowicz N., Kolcz A., Editorial:Special issue on
learning from imbalanced data sets. SIGKDD Explorations 6(1),
1-6 ,2004.
[5] Phua C, Alahakoon D, Lee V. Minority report in fraud detection:
Classification of skewed data[J]. SIGKDD Explore,,6(1):50-
59,2004.
[6] Del Castillo M D, Serrano J I. A multi strategy approach for
digital text categorization from imbalanced documents[J].
SIGKDD Explorer,6(1):70-79,2004.
[7] Turney P D. Learning algorithms for keyphrase extraction [J].
Information Retrieval,,2(4):303-336,2000.
[8] Ling C X, Li C. Data mining for direct marketing: Problems and
solutions[J]. Knowledge Discovery and Data Mining,.73-
79,1998.
[9] Ivan Tomek (1976). “Two Modifications of CNN”. IEEE Transactions
on Systems,Man, and Cybernatics, Vol. 6, No. 11,
pp.769-722,1976.
[10] Miroslav Kubat,Matwin Stan, “Addressing the Curse of Imbalanced
Training Sets: One-Sided Selection”, the 14th International
Conference on Machine Learning, pp. 179-186.
[11] Peter E. Hart 1968. “The Condensed Nearest Neighbor Rule”.
The IEEE Transactions on Information Theory, Vol. 14, No. 3,
pp.515-516,1968.
[12] Chawla N V, Hall L O , Bowyer k W, et al. SMOTE: Synthetic
Minority Oversampling Technique[J]. Journal of Artificial Intelligence
Research, 16:321:357,2002.
[13] Gustavo E.A. P.A. Bastista, Prati Ronaldo C., Monard Maria
Carolina. “ A Study of the Behavior of Several Methods for Balancing
machine Learning Training Data”. ACM SIGKDD Explorations
Newsletter, Vol. 6, No. 1, pp.20-29,2004.
[14] Charles Elkan, “The Foundations of Cost-Sensitive Learning,”
the Sevnteenth International Joint Conference on Artificial Intelligence,
pp. 973-978.
[15] Pedro Domingos, “Metacost: A General Method for Making
Classifiers Cost-Sensitive,” the fifth ACM SIGKDD international
conference on Knowledge discovery and data mining, pp.
155-164.
[16] Y. Freund and R. E. Schapire. “A decision-theoretic generalization
of on-line learning and an application to boosting”. Journal
of Computer and System Science, 55(1):119-139, 1997.
[17] Andrew P Bradley. “The Use of the Area under the ROC Curve
in the Evaluation of Machine Learning Algorithms”. Pattern
Recognition, Vol. 30, No. 7,pp.1145-1159,1997.
[18] A. Asuncion and D.J. Newman. UCI machine learning repository,
2007. University of California, Irvine, School of Information
and Computer Sciences. URL
http://www.ics.uci.edu/~mlearn/MLRepository.html.
[19] The Weka Machine Learning Workbench.
http://www.cs.waikato.ac.nz/ml/weka.
[20] Weiss G., Mining with rarity: a unifying framework. SIGKDD
Explorations 6(1), 7-19 2004.
[21] Platt J.C. Fast training of support vector machines using sequential
minimal optimization. In: scholkopf B, Burges C, Smola A, eds. Advances
in Kernal Methods Support Vector Learning. Cambridge, MA:
MIT press. 185-208, 1999.
|
|
|