A Novel Class Imbalance Learning Method using Subset Filtering

Home >> Journal >> IJSER

International Journal of Scientific and Engineering Research

ISSN Online 2229-5518

ISSN Print: 2229-5518 9

Website: http://www.ijser.org

IJSER >> Volume 3,Issue 9,September 2012

A Novel Class Imbalance Learning Method using Subset Filtering

Full Text(PDF, ) PP.95‐I03

Author(s)

K. Nageswara Rao, Prof. T. Venkateswara rao, Dr. D. Rajya Lakshmi

KEYWORDS

Classification, class imbalance, weighted sampling, subset filtering.

In many real-world applications, the problem of learning from imbalanced data (the imbalanced learningproblem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledgerepresentation.In this paper, we present a new hybrid subset filtering approach for learning from skewed trainingdata. This algorithm provides a simpler and faster alternative by using C4.5 as base algorithm. We conduct experiments usingeleven UCI data sets from various application domains using f0ur base learners,and five evaluation metrics. Experimentalresults show that our method has higher Area under the ROC Curve, F-measure, precision, TP rate and TN rate val-ues than many existing class imbalance learning methods.


References

[1] J. Wu, S. C. Brubaker, M. D. Mullin, and J. M. Rehg, “Fast asymmetric learning for cascade face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 369–382, Mar. 2008. [2] N. V. Chawla, N. Japkowicz, and A. Kotcz, Eds., Proc. ICML Workshop Learn. Imbalanced Data Sets, 2003. [3] N. Japkowicz, Ed., Proc. AAAI Workshop Learn. Imbalanced Data Sets, 2000. [4] G. M.Weiss, “Mining with rarity: A unifying framework,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 7–19, Jun. 2004. [5] N. V. Chawla, N. Japkowicz, and A. Kolcz, Eds., Special Issue Learning Imbalanced Datasets, SIGKDD Explor. Newsl.,vol. 6, no. 1, 2004. [6] W.-Z. Lu and D.Wang, “Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme,” Sci. Total. Enviro., vol. 395, no. 2-3, pp. 109– 116, 2008. [7] Y.-M. Huang, C.-M. Hung, and H. C. Jiau, “Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem,” Nonlinear Anal. R. World Appl., vol. 7, no. 4, pp. 720–747, 2006. [8] D. Cieslak, N. Chawla, and A. Striegel, “Combating imbalance in network intrusion datasets,” in IEEE Int. Conf. Granular Comput., 2006, pp. 732–737. [9] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, and G. D. Tourassi, “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Netw., vol. 21, no. 2–3, pp. 427–436, 2008. [10] A. Freitas, A. Costa-Pereira, and P. Brazdil, “Cost-sensitive decision trees applied to medical data,” in Data Warehousing Knowl. Discov. (Lecture Notes Series in Computer Science), I. Song, J. Eder, and T. Nguyen, Eds., [11] K.Kilic¸,O¨ zgeUncu and I. B. Tu¨rksen, “Comparison of different strategies of utilizing fuzzy clustering in structure identification,” Inf. Sci., vol. 177, no. 23, pp. 5153–5162, 2007. [12] M. E. Celebi, H. A. Kingravi, B. Uddin, H. Iyatomi, Y. A. Aslandogan, W. V. Stoecker, and R. H. Moss, “A methodological approach to the classification of dermoscopy images,” Comput.Med. Imag. Grap., vol. 31, no. 6, pp. 362–373, 2007. [13] X. Peng and I. King, “Robust BMPM training based on second-order cone programming and its application in medical diagnosis,” Neural Netw., vol. 21, no. 2–3, pp. 450–457, 2008.Berlin/Heidelberg, Germany: Springer, 2007, vol. 4654, pp. 303–312. [14] RukshanBatuwita and Vasile Palade (2010) FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning, IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 18, NO. 3, JUNE 2010, pp no:558-571. [15] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol. 6, pp. 429-450, 2002. [16] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection,” Proc. 14th Int’l Conf. Machine Learning, pp. 179-186, 1997. [17] G.E.A.P.A. Batista, R.C. Prati, and M.C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” SIGKDD Explorations, vol. 6, pp. 20-29, 2004.1 [18] D. Cieslak and N. Chawla, “Learning decision trees for unbalanced data,” in Machine Learning and Knowledge Discovery in Databases. Berlin, Germany: Springer-Verlag, 2008, pp. 241–256. [19] G.Weiss, “Mining with rarity: A unifying framework,” SIGKDD Explor.Newslett., vol. 6, no. 1, pp. 7–19, 2004. [20] N. Chawla, K. Bowyer, and P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002. [21] J. Zhang and I. Mani, “KNN approach to unbalanced data distributions: A case study involving information extraction,” in Proc. Int. Conf. Mach. Learning, Workshop: Learning Imbalanced Data Sets, Washington, DC, 2003, pp. 42–48. [22] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 40– 49, 2004. [23] S. Zou, Y. Huang, Y. Wang, J. Wang, and C. Zhou, “SVM learning from imbalanced data by GA sampling for protein domain prediction,” in Proc. 9th Int. Conf. Young Comput. Sci., Hunan, China, 2008, pp. 982– 987. [24] A. Asuncion D. Newman. (2007). UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA: Univ. of California [Online]. Available: http://www.ics.uci.edu/∼mlearn/MLRepository.htmJ. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed. San Mateo, CA: Morgan Kaufmann Publishers, 1993. [25] C.-T. Su and Y.-H. Hsiao, “An evaluation of the robustness of MTS for imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 10, pp. 1321– 1332, Oct. 2007. [26] [60] D. Drown, T. Khoshgoftaar, and N. Seliya, “Evolutionary sampling and software quality modeling of high-assurance systems,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans., vol. 39, no. 5, pp. 1097–1107, Sep. 2009. [27] S. Garc´ıa, A. Fern´andez, and F. Herrera, “Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems,” Appl. Soft Comput., vol. 9, no. 4, pp. 1304–1314, 2009. [28] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowl. Inf. Syst., vol. 14, pp. 1–37, 2007.

Untitled Page