Author Topic: Biological Sequence Matching Using Fuzzy Logic  (Read 2336 times)

0 Members and 1 Guest are viewing this topic.

IJSER Content Writer

  • Sr. Member
  • ****
  • Posts: 327
  • Karma: +0/-1
    • View Profile
Biological Sequence Matching Using Fuzzy Logic
« on: August 21, 2011, 06:34:42 am »
Quote
Author : Nivit Gill, Shailendra Singh
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011
ISSN 2229-5518
Download Full Paper : PDF

Abstract: Sequence alignment is the most basic and essential module of computational bio-informatics. In this paper, we propose a multiple sequence alignment algorithm that employs fuzzy logic to measure the similarity of sequences based on fuzzy parameters. To guarantee the optimal alignment of the sequences, dynamic programming is used to align the sequences. The algorithm is tested on few sets of real biological sequences taken from NCBI bank and its performance is evaluated using SinicView tool.
Index Terms: Biological sequences, Dynamic programming, Fuzzy logic, Fuzzy matching score, Fuzzy parameters, Global alignment, Multiple sequence alignment.

1 INTRODUCTION
Genetic code of every organism can be represented as a sequence of alphabets, such as four base pairs of DNA and RNA, or twenty amino acids of protein. Over time, these biological sequences undergo changes, called mutations, and as a result, many organisms evolve. All living organism cell are composed of DNA molecules that are passed from one generation to other. This is the reason for some living organisms being biologically similar and some being distinct. The goal of bioinformatics is to align a large number of sequences in order to study their evolutionary relationships through comparative sequence analysis.
With the help of bio-informatics, computations are applied to the biological sequences in order to analyze and manipulate them. The prime objective behind this is to discover and record the role of genetics in an organism’s biological characteristics. Sequence alignment is the most basic and essential part of computational bio-informatics and provides a base for other tasks of bioinformatics, such as sequence assembly, sequence annotation, structural and functional prediction, evolutionary or phylogeny relationship analysis. A sequence alignment refers to the method of arranging biological sequences in order to search similar regions in the sequences. The sequences with high degree of similarity have similar structure and function, and such sequences help in deriving evolutionary or phylogenetic relationships among organisms.
In this paper, we propose to align multiple biological sequences by matching the sequences using fuzzy logic. In the proposed method, the given biological sequences are compared pair wise so as to determine the number of matches, and mismatches between them. Then these counts are fuzzified using fuzzy membership functions, and then fuzzified counts are put in an aggregate fuzzy function in order to find the fuzzy match value of the two sequences. The fuzzy match value is further used to order the sequences according to the similarity. The most similar pair is aligned first and the rest of the sequences are then aligned progressively, to this aligned pair.
The outline of this paper is: Section 2 discusses the basics of sequence alignment and its types and Section 3 reviews the related work. The fuzzy logic concepts and its usage in the proposed algorithm are provided in Section 4. Section 5 details the classical Needleman-Wunsch algorithm. The proposed algorithm is described in Section 6. Experimental results and their discussions are presented in Section 7 and finally Section 8 concludes the paper.

2 SEQUENCE ALIGNMENT CONCEPTS

Any biological sequence is formed from a sequence of characters drawn from an alphabet. For DNA sequence, the character alphabet is {A, C, G, T}, for RNA sequence, the alphabet is {A, C, G, U}, and for protein sequence, the character set is {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V}. A sequence alignment is the process of identifying one-to-one correspondence among sub-units of sequences in order to measure the similarities among them. These similar regions provide functional, structural, and evolutionary information about the sequences under study. Aligned sequences are generally represented as rows within a matrix. Gaps (‘-‘) are inserted between the characters so that identical or similar characters are aligned in successive columns. Gaps are also called indels, as they represent insertion of a character in or a deletion of a character from a biological sequence. Sequence alignment of two biological sequences is called pair-wise sequence alignment, and in case more than two biological sequences are involved, it is called multiple sequence alignment [12]. The sequence alignment can be global or local. Global alignment "forces" the alignment to span the entire length of all query sequences (Figure 1). Local alignments identify regions of similarity within long sequences that are often widely divergent overall (Figure 2).

Scoring Matrices are used to quantify the similarity achieved by an alignment [12]. These matrices contain a value (positive, zero or negative value) for each possible substitution, and the alignment score is the sum of the matrix's entries for each aligned pair. For gaps (indels), a special gap score is used--a very simple one is just to add a constant penalty score for each indel. The optimal alignment is the one which maximizes the alignment score. Commonly used matrices are PAM (Percent Accepted Mutations) matrices, BLOSUM (BLOck SUbstitution Matrix), etc. The most popular algorithms for local and global alignment are based on dynamic programming: Needleman-Wunsch algorithm for global alignment, and Smith-Waterman algorithm for local alignment.

Read More: Click here...