IJSER Home >> Journal >> IJSER
International Journal of Scientific and Engineering Research
ISSN Online 2229-5518
ISSN Print: 2229-5518 4    
Website: http://www.ijser.org
scirp IJSER >> Volume 2, Issue 4, April 2011 Edition
Feature Selection for Cancer Classification: A Signal-to-noise Ratio Approach
Full Text(PDF, 3000)  PP.  
Debahuti Mishra, Barnali Sahu
Classification, Feature selection, Cancer data, Microarray, Signal-to-noise ratio
Cancers are generally caused by abnormalities in the genetic material of the transformed cells. Cancer has a reputation as a deadly disease hence cancer research is intense scientific effort to understand disease. Classification is a machine learning technique used to predict group membership for data instances. There are several classification techniques such as decision tree induction, Bayesian classifier, k-nearest neighbor (k-NN), case-based reasoning, support vector machine (SVM), genetic algorithm etc. Feature selection for classification of cancer data is to discover gene expression profiles of diseased and healthy tissues and use the knowledge to predict the health state of new sample. It is usually impractical to go through all the details of the features before picking up the right features. This paper provides a model for feature selection using signal-to-noise ratio (SNR) ranking. Basically we have proposed two approaches of feature selection. In first approach, the genes of microarray data is clustered by k-means clustering and then SNR ranking is implemented to get top ranked features from each cluster and given to two classifiers for validation such as SVM and k-NN. In the second approach the features (genes) of microarray data set is ranked by implementing only SNR ranking and top scored feature are given to the classifier and validated. We have tested Leukemia data set for the proposed approach and 10fold cross validation method to validate the classifiers. The 10fold validation result of two approaches is compared with hold out validation result and again with results of leave one out cross validation (LOOCV) of different approaches in the literature. From the experimental evaluation we got 99.3% accuracy in first approach for both k-NN and SVM classifiers with five numbers of genes and with 10fold cross validation method. The accuracy result is compared with the accuracy of different methods available in the literature for leukemia data set with LOOCV, where only multiple-filter-multiple wrapper approach gives 100% accuracy in LOOCV with leukemia data set.
[1] Gregory Piatetsky-Shapiro, Pablo Tamayo, “Microarray Data Mining: Facing the Challenges”, SIGKDD Explorations, Volume5, Issue 2, pp. 1-5, June 2003

[2] Minca Mramor Gregor Leban, Janez Demar and Bla Zupan, 2007, ""Visualization-based cancer microarray data classification analysis"", Bioinformatics, Vol. 23, No.16, pp.2147-2154, 2007.

[3] Wolfgang Huber, Anja Von Hey debreck, Martin Vingron,”Analysis of microarray gene expression data, Hand book of statistics genetics,” 2nd edition, Wiley.2003

[4] Hong-Hai Do,Toralf Kirsten,Erhard Ralm,”Comparative Evaluation of Microarray-based Gene expression Database,GIProceedings, pp 26-34.

[5] Ana C.lorena, Ivan G.costa, Marcilio c. p. de Souto”,On the complexity of gene expression classification data sets,” Eighth International Conference on Hybrid intelligent System,pp 825-830.2008

[6] V.N. Vapnik, “Statistical Learning Theory”, Wiley- Interscience Publications, 1998

[7] Vapnik VN.”The nature of statistical Theory”.Springer- Verlag;1995

[8] Miroslava Cuperlovic-Cuf, Nabil Belacel, Rodney. j. Ouellette, “Determination of Tumour marker genes from gene expression data, DDT”, Vol-10, Number 6 pp429-437, 2005

[9] Wai-Ho Au,Keith C.C.Chan,Andrew K.C. Wong, Yang Wang. Attribute clustering for Grouping, IEEE/ACM Transactions on computational biology and Bioinformatics, Vol 2.,No 2, pp83-101,2005

[10] Supoj Hengpraprohm, Prabhas Chongstitvatana, “Selecting Informative Genes from Microarray Data for Cancer Classification with Genetic Programming Classifier using KMeans Clustering and SNR Ranking”, Frontiers in the Convergence of Bioscience and Information Technologies , pp211-216, 2007.

[11] Hualong Yu,Guochang Gu,Haibo Liu,Jing Shen, Changming Zhu,. “A Novel Discrete Particle Swarm Optimization Algorithm for Microarray Data-based Tumor Marker Gene Selection”, International Conference on Computer science and software Engineering, pp. 1057-1060, 2008

[12] Yukyee Leung, Yeungsam Hung, “A Multi-Filter-Multi- Wrapper “Approach to Gene Selection and Microarray Date Classification”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vo1. 7, No .1, pp.108-117, 2010.

[13] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene ExpressionMonitoring,” Science, vol. 286, no. 5439, pp. 531-537, 1999.

[14] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack,and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat’l Academy of Sciences USA, vol. 96, no. 12, pp. 6745-6750, 1999

[15] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J.A. Olson Jr., J.R. Marks, and J.R. Nevins, “Predictingthe Clinical Status of Human Breast Cancer by Using Gene Expression Profiles,” Proc. Nat’l Academy of Sciences USA, vol. 98, no. 20, pp. 11462-11467, 2001.

[16] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, “Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning,” Nature Medicine, vol. 8, pp. 68-74, 2002.

[17] D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd, P. Tamayo, A. Renshaw, A. D’Amico, and J. Richie, “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, no. 2, pp. 203-209, 2002.

[18] G.J. Gordon, R.V. Jensen, L.L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno, “Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma,” Cancer Research, vol. 62, no. 17, pp. 4963-4967, 2002.

[19] Shamsul Hunda, John Yearwood, Andrew Strainieri, “Hybrid wrapper-filter approach for input feature selection using Maximum Revalance and Artificial Neural Network Input Gain Measurement Approximation”, Fourth International conference on Network and system security, pp442-449, 2010.

[20] Chenn-Jung Huang ,Wei-Chen Liao, “A Comparative Study of Feature Selection Methods for Probabilistic Neural Networks in Cancer Classification”, Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’03),Vol 3, pp1082-3409, 2003.

[21] http://sdmc.lit.org.sg/GEDatasets/

[22] Debahuti Mishra, Barnali Sahu,”A signal to noise classification model for identification of differentially expressed genes from gene expression data,”3rd International conference on electronics computer technology,2011(Accepted)

Untitled Page