Author Topic: Feature Selection for Cancer Classification: A Signal-to-noise Ratio Approach  (Read 2862 times)

0 Members and 1 Guest are viewing this topic.


  • Newbie
  • *
  • Posts: 48
  • Karma: +0/-0
    • View Profile
Author : Debahuti Mishra, Barnali Sahu
International Journal of Scientific & Engineering Research, IJSER - Volume 2, Issue 4, April-2011
ISSN 2229-5518
Download Full Paper -

Abstract— Cancers are generally caused by abnormalities in the genetic material of the transformed cells. Cancer has a reputation as a deadly disease hence cancer research is intense scientific effort to understand disease. Classification is a machine learning technique used to predict group membership for data instances. There are several classification techniques such as decision tree induction, Bayesian classifier, k-nearest neighbor (k-NN), case-based reasoning, support vector machine (SVM), genetic algorithm etc. Feature selection for classification of cancer data is to discover gene expression profiles of diseased and healthy tissues and use the knowledge to predict the health state of new sample. It is usually impractical to go through all the details of the features before picking up the right features. This paper provides a model for feature selection using signal-to-noise ratio (SNR) ranking. Basically we have proposed two approaches of feature selection. In first approach, the genes of microarray data is clustered by k-means clustering and then SNR ranking is implemented to get top ranked features from each cluster and given to two classifiers for validation such as SVM and k-NN. In the second approach the features (genes) of microarray data set is ranked by implementing only SNR ranking and top scored feature are given to the classifier and validated. We have tested Leukemia data set for the proposed approach and 10fold cross validation method to validate the classifiers. The 10fold validation result of two approaches is compared with hold out validation result and again with results of leave one out cross validation (LOOCV) of different approaches in the literature.  From the experimental evaluation we got 99.3% accuracy in first approach for both k-NN and SVM classifiers with five numbers of genes and with 10fold cross validation method. The accuracy result is compared with the accuracy of different methods available in the literature for leukemia data set with LOOCV, where only multiple-filter-multiple wrapper approach gives 100% accuracy in LOOCV with leukemia data set.

Index Terms—Classification, Feature selection, Cancer data, Microarray, Signal-to-noise ratio

ALL organisms except viruses consist of cells. East has one cell, where as human have trillions of cells. Document cell consists of nucleus and inside nucleus there is DNA, which encodes the programs for making future organisms. Genes make proteins in two steps. First DNA transcribed to mRNA and mRNA is translated in to proteins [1]. Gene expression is the activation of genes that results in a protein. Proteins are the blue prints for the characteristics of the living organisms.
A microarray is a sequence of dots of DNA, protein, or tissue arranged in an array for easy simultaneous analysis. The most famous is the DNA microarray, which plays an integral role in gene expression profiling. The substrate material is glass, plastic or a silicon chip. Important applications of microarrays include the identification of genetic individuality of tissues or organisms, the diagnosis of genetic and infectious disease [2][3].

Cancers are caused by abnormalities in the genetic materials of the transformed cells. It mostly results from acquired mutations and epigenetic changes that influence gene expression. A major focus in cancer research is identifying genetic markers. Clinical diagnosis of cancer based on gene expression data has two main targets: first to achieve the correct diagnostic for a cancer patient with a greatest confidence. Second, to identify the gene responsible for a particular type of cancer, this helps in the diagnosis and prognosis of cancer. These objectives imply to develop best classification models which ensure a true classification of a cancer sample with a low risk of misclassification. Many high level data analysis techniques such as clustering and classification algorithms work better with smaller number of genes. This approach usually covers one or more components of microarray data analysis that include dimensionality reduction through a gene subset selection, the construction of new predictive features and model inference [2].
The goal of this paper is to make an intensive study on the techniques available for finding the patterns among the genes or feature selection using SNR ranking and to analyze the result of our two approaches for feature selection which gives significant meaning to classify the genes which are responsible for cancer disease.
This paper is arranged in the following way: introduction to cancer classification data is given in section 1, section 2 deals with preliminary concept of microarray, classification techniques, SNR ranking, k-means clustering. Section 3 deals with related work on feature selection of cancer data using SNR approach, section 4 deals with the proposed model, section 5 contains experimental evaluation, section 6 explains the validation and comparison of our work and section 7 concludes the paper.

2.1    Microarray

All cells in an organism carry the same genetic information and only a subset of the genes is active (expressed). Analyzing the gene with respect to whether and to what degree they are expressed can help characterize and understand their functions. It can further be analyzed how the activation level of genes changes under different conditions such as for specific diseases [3][4].
Microarray data are generally high dimensional data having large number of genes in comparison to the number of samples or conditions. There are many efficient methods for the analysis of microarray data such as clustering, classification and feature selection.
Feature selection is the preprocessing task for both clustering and classification. Different types of experiment can be done by microarray technology. Microarray technology measures the expression level of genes. That can be used in the diagnosis, through the classification of different types of cancerous genes leading to a cancer type[5].Basically, genes of microarray data are treated as features, a set of features(genes) give rise to a pattern. If we could get the correct pattern from the data set it is easier to classify an unknown sample based on that pattern.

2.2    Classification Technique Revisited

Our study is mainly based on feature selection and pattern classification for gene expression data related to cancer diagnosis. There are several classification techniques such as SVM, k-NN, neural network, naïve bayesian, decision tree, random forest, top scoring pair.

k-NN: k-NN is the simplest ML technique for classifying objects based on closest training examples in the feature space[6]. It is instance based learning. It gathers all training data and classifiers often via a majority voting, a new data point with respect to the class of its k-nearest neighbor in the given data set. k-NN obtain the neighbors in the given data set.  k-NN obtain the neighbors for each data by using Euclidian or Mahalanobis distance between pairs of data items. The major advantage of k-NN is its simplicity.

Support Vector Machine (SVM): Support vector machines (SVM) is a supervised learning techniques which analyze data and recognize patterns, used for statistical methods and regression analysis[7]. SVM training algorithm builds a model that predicts whether a new sample falls into one category or the other. SVM model is a representation of the samples as points in space, mapped so that the samples of the separate categories are divided by a clear gap that is as wide as possible. New samples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. Support vector machine constructs a hyper plane or a set of hyper planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.

2.3    k-means clustering Algorithm

 Input:     k  =   Number of clusters
                P =   A data set containing n features (n number of genes)

1.    Select number of cluster k.
2.    Randomly choose k features from the data set as the initial cluster center.
3.    Repeat until the termination criteria fulfilled
3.1    Assign each feature to one of the clusters according to the similarity measure
 3.2    Update the cluster means.
4.   until no change in the value of cluster’s mean

In this approach we have used Euclidean distance as distance measure.

2.4    Signals-to-Noise Ratio
The signal to noise ratio (SNR) test identifies the expression patterns with a maximal difference in mean expression between two groups and minimal variation of expression within each group [8]. In this method genes are first ranked according to their expression levels using SNR test Statistic.

Read More: