Feature Selection for Cancer Classification: A Signal-to-noise Ratio Approach
|
Full Text(PDF, 3000) PP.
|
|
Author(s) |
Debahuti Mishra, Barnali Sahu |
|
KEYWORDS |
Classification, Feature selection, Cancer data, Microarray, Signal-to-noise ratio
|
|
ABSTRACT |
Cancers are generally caused by abnormalities in the genetic material of the transformed cells. Cancer has a reputation as a deadly disease hence cancer research is intense scientific effort to understand disease. Classification is a machine learning technique used to predict group membership for data instances. There are several classification techniques such as decision tree induction, Bayesian classifier, k-nearest neighbor (k-NN), case-based reasoning, support vector machine (SVM), genetic algorithm etc. Feature selection for classification of cancer data is to discover gene expression profiles of diseased and healthy tissues and use the knowledge to predict the health state of new sample. It is usually impractical to go through all the details of the features before picking up the right features. This paper provides a model for feature selection using signal-to-noise ratio (SNR) ranking. Basically we have proposed two approaches of feature selection. In first approach, the genes of microarray data is clustered by k-means clustering and then SNR ranking is implemented to get top ranked features from each cluster and given to two classifiers for validation such as SVM and k-NN. In the second approach the features (genes) of microarray data set is ranked by implementing only SNR ranking and top scored feature are given to the classifier and validated. We have tested Leukemia data set for the proposed approach and 10fold cross validation method to validate the classifiers. The 10fold validation result of two approaches is compared with hold out validation result and again with results of leave one out cross validation (LOOCV) of different approaches in the literature. From the experimental evaluation we got 99.3% accuracy in first approach for both k-NN and SVM classifiers with five numbers of genes and with 10fold cross validation method. The accuracy result is compared with the accuracy of different methods available in the literature for leukemia data set with LOOCV, where only multiple-filter-multiple wrapper approach gives 100% accuracy in LOOCV with leukemia data set.
|
|
References |
|
[1] Gregory Piatetsky-Shapiro, Pablo Tamayo, “Microarray
Data Mining: Facing the Challenges”, SIGKDD Explorations,
Volume5, Issue 2, pp. 1-5, June 2003
[2] Minca Mramor Gregor Leban, Janez Demar and Bla
Zupan, 2007, ""Visualization-based cancer microarray
data classification analysis"", Bioinformatics, Vol. 23,
No.16, pp.2147-2154, 2007.
[3] Wolfgang Huber, Anja Von Hey debreck, Martin
Vingron,”Analysis of microarray gene expression data, Hand
book of statistics genetics,” 2nd edition, Wiley.2003
[4] Hong-Hai Do,Toralf Kirsten,Erhard Ralm,”Comparative
Evaluation of Microarray-based Gene expression Database,GIProceedings,
pp 26-34.
[5] Ana C.lorena, Ivan G.costa, Marcilio c. p. de Souto”,On the
complexity of gene expression classification data sets,” Eighth
International Conference on Hybrid intelligent System,pp
825-830.2008
[6] V.N. Vapnik, “Statistical Learning Theory”, Wiley-
Interscience Publications, 1998
[7] Vapnik VN.”The nature of statistical Theory”.Springer-
Verlag;1995
[8] Miroslava Cuperlovic-Cuf, Nabil Belacel, Rodney. j.
Ouellette, “Determination of Tumour marker genes from gene
expression data, DDT”, Vol-10, Number 6 pp429-437, 2005
[9] Wai-Ho Au,Keith C.C.Chan,Andrew K.C. Wong, Yang
Wang. Attribute clustering for Grouping, IEEE/ACM
Transactions on computational biology and
Bioinformatics, Vol 2.,No 2, pp83-101,2005
[10] Supoj Hengpraprohm, Prabhas Chongstitvatana, “Selecting
Informative Genes from Microarray Data for Cancer
Classification with Genetic Programming Classifier using KMeans
Clustering and SNR Ranking”, Frontiers in the
Convergence of Bioscience and Information Technologies ,
pp211-216, 2007.
[11] Hualong Yu,Guochang Gu,Haibo Liu,Jing Shen,
Changming Zhu,. “A Novel Discrete Particle Swarm
Optimization Algorithm for Microarray Data-based Tumor
Marker Gene Selection”, International Conference on Computer
science and software Engineering, pp. 1057-1060, 2008
[12] Yukyee Leung, Yeungsam Hung, “A Multi-Filter-Multi-
Wrapper “Approach to Gene Selection and Microarray Date
Classification”, IEEE/ACM Transactions on Computational
Biology and Bioinformatics, Vo1. 7, No .1, pp.108-117, 2010.
[13] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M.
Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R.
Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S.
Lander, “Molecular Classification of Cancer: Class Discovery
and Class Prediction by Gene ExpressionMonitoring,” Science,
vol. 286, no. 5439, pp. 531-537, 1999.
[14] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D.
Mack,and A.J. Levine, “Broad Patterns of Gene Expression
Revealed by Clustering Analysis of Tumor and Normal Colon
Tissues Probed by Oligonucleotide Arrays,” Proc. Nat’l
Academy of Sciences USA, vol. 96, no. 12, pp. 6745-6750,
1999
[15] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida,
R. Spang, H. Zuzan, J.A. Olson Jr., J.R. Marks, and J.R.
Nevins, “Predictingthe Clinical Status of Human Breast
Cancer by Using Gene Expression Profiles,” Proc. Nat’l
Academy of Sciences USA, vol. 98, no. 20, pp. 11462-11467,
2001.
[16] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok,
R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S.
Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A.
Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster,
and T.R. Golub, “Diffuse Large B-Cell Lymphoma Outcome
Prediction by Gene-Expression Profiling and Supervised
Machine Learning,” Nature Medicine, vol. 8, pp. 68-74,
2002.
[17] D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd,
P. Tamayo, A. Renshaw, A. D’Amico, and J. Richie, “Gene
Expression Correlates of Clinical Prostate Cancer Behavior,”
Cancer Cell, vol. 1, no. 2, pp. 203-209, 2002.
[18] G.J. Gordon, R.V. Jensen, L.L. Hsiao, S.R. Gullans, J.E.
Blumenstock, S. Ramaswamy, W.G. Richards, D.J.
Sugarbaker, and R. Bueno, “Translation of Microarray Data
into Clinically Relevant Cancer Diagnostic Tests Using Gene
Expression Ratios in Lung Cancer and Mesothelioma,” Cancer
Research, vol. 62, no. 17, pp. 4963-4967, 2002.
[19] Shamsul Hunda, John Yearwood, Andrew Strainieri,
“Hybrid wrapper-filter approach for input feature
selection using Maximum Revalance and Artificial Neural
Network Input Gain Measurement Approximation”,
Fourth International conference on Network and system
security, pp442-449, 2010.
[20] Chenn-Jung Huang ,Wei-Chen Liao, “A Comparative
Study of Feature Selection Methods for Probabilistic
Neural Networks in Cancer Classification”, Proceedings of
the 15th IEEE International Conference on Tools with Artificial
Intelligence (ICTAI’03),Vol 3, pp1082-3409, 2003.
[21] http://sdmc.lit.org.sg/GEDatasets/
[22] Debahuti Mishra, Barnali Sahu,”A signal to noise
classification model for identification of differentially expressed
genes from gene expression data,”3rd International conference
on electronics computer technology,2011(Accepted)
|
|
|