Tuesday, June 4, 2019

Sequence Alignment and Dynamic Programming

time co-occurrence and energising ProgrammingIntroductionSequence alinementSequence coalescency is a standard order to comp be ii or more ages by looking for a series of individual characters or character patterns that argon in the same order in the sequences 1. Also, it is a way of arranging two or more sequences of characters to recognize regions of alikeity 2.Importance of sequence conjunctiveSequence alignment is signifi ignoret because in bimolecular sequences (DNA, RNA, or protein), high sequence convertibleity usu on the wholey implies valuable functional or structural similarity that is the introductory step of many a(prenominal) biological analysis 3. Besides, sequence alignment loafer address significant questions such as discover gene sequences that cause disorder or susceptibility to disease, identifying changes in gene sequences that cause evolution, start outing the relationship between various gene sequences that can indicate the common ancestry 4, de tecting function every(prenominal)y important sites, and demonstrating mutation events 5.Analysis of the alignment can reveal important information. It is attainable to identify the parts of the sequences that are likely to be important for the function, if the proteins are involved in similar processes .The random mutations can accumulate more easily in parts of the sequence of a protein which are not very essential for its function. In the parts of the sequence that are essential for the function hardly any mutations will be accepted because approximately all changes in such regions will destroy the function 6. Moreover, Sequence alignment is important for assigning function to unknown proteins 7. Protein alignment of two residues implies that those residues perform similar roles in the two different proteins 8. modesThe main purpose of sequence alignments methods is finding maximum degree of similarities and minimum evolutionary distance. Generally, tallyal approaches to solve sequence alignment problems can be divided into two categories global alignments and local alignments. world(prenominal) alignments traverse the entire space of all query sequences, and match as many characters as possible from end to end. These alignment methods are most useful when the sequences turn out approximately the same size or they are similar. The alignment is performed from runner of the sequence to end of the sequences to find out the best possible alignment. On the other hand, Local alignments find the local regions with high level of similarity. They are more useful for sequences that are suspected to contain regions of similarity within their larger sequence context. 9Besides, pairwise sequence alignment is used to find the regions of similarity between two sequences. As the number of sequences increases, comparing from each one and every sequence to every other may be impossible. So, we need multiple sequence alignment, where all similar sequences can be compar ed in one single figure or table. The basic idea is that the sequences are aligned on crystalize of each other, so that a coordinate system is set up, where each wrangle is the sequence for one protein, and each column is the same position in each sequence. 10 in that location are many different approaches and implementations of the methods to perform sequence alignment. These include techniques such as dynamic programming , heuristic algorithmic rules (BLAST and FASTA similarity searching), probabilistic methods, dot-matrix methods, progressive methods, ClustalW , ponderosity , T-Coffee , and DIALIGN.Dynamic programmingDynamic programming (DP) is a problem solving method for a class of problems that can be solved by dividing them prevail over into simpler sub-problems. It finds the alignment by giving some lay downs for matches and mismatches (Scoring matrices).This method is widely used in sequence alignments problems. 11 However, when the number of the sequences is more th an two, multiple dimensional Dynamic programming in infeasible because of the large storage and computational complexities.16Dynamic programming algorithms use open up penalties to increase the biological meaning 9. thither are different gap penalties such as linear gap, constant gap, gap open and gap extension. The gap score is a punishment given to alignment when thither is insertion or deletion. There may be a case where there are perpetual gaps all along the sequence during the evolution, so the linear gap penalization would not be suitable for the alignment. Therefore, gap opening penalty and gap extension penalty has been introduced when there are continuous gaps. The gap opening penalty is applied at the start of the gap, and then the other gap following it is given with a gap extension penalty which will be less compared to the open penalty. Different gap penalty functions wait different dynamic programming algorithms 12. Also there is a replacing matrix to score alignm ents. The mainly used predefined scaling matrices for sequence alignment are PAM (Point Accepted Mutation) and BLOSUM (Blocks Substitution Matrix).The two algorithms, Smith-Waterman for local alignment and Needleman-Wunsch for global alignment, are based on dynamic programming.Needleman-Wunsch algorithm requires alignment score for a pair of residues to be equal or more than zero. No gap penalty is required, and score cannot decrease between two cellular phones of pathway. Smith-Waterman requires a gap penalty to work efficiently. Residue alignment score may be positive or negative .Score can increase, decrease, or stay level between two cells of pathway 13.Sequence Alignment ProblemsFor an n-character sequence s, and an m-character sequence t , we construct an (n+1)(m+1)matrix .Global alignment F ( i, j ) = score of the best alignment of s1i with t1jLocal alignment F ( i, j ) = score of the best alignment of a suffix of s1i and a suffix of t1jThere are three steps in the seque nce alignments algorithmsInitializationIn the initialization phase, we assign values for the first row and column of the alignment matrix .The next step of the algorithm depends on this.FillIn the fill stage, the entire matrix is filled with scores from top to bottom, left to right with appropriate values that depend on the gap penalties and scoring matrix.Trace backFor each F ( i, j ), save pointers to cell that resulted in best score . For global alignment, we conform to pointers back from F (m, n) to F(0, 0) to recover sequence alignments . For local alignment, we are looking for the maximum value of the F (i, j) that can be anywhere in the matrix. We trace pointers back from F (i, j) and stop when we get to a cell with value 0.Local alignment with scoring matrixAfter creating and initializing the alignment matrix ( F ) and trace back matrix, the score of F (i, j) for every cell is calculated as followsFor i = 1 to n+1For j = 1 to m+1left_score= Fi j-1 gap,diagonal_score=Fi-1 j -1 + PAM250(si, tj),up_score= Fi-1 j gapscores=max 0, left_score, diagonal_score, up_scoreAlso, we should keep the reference to each cell to perform backtracking.traceback_matrixij= scores.index(Fij)After fill the F matrix, we find the optimal alignment score and the optimal end points by finding the highest scoring cell, maxi,jF(i , j) . best_score has a default value equals to -1 .if F ij best_scorebest_score= F iji_maximum_score, j_maximum_score = i, jTo recover the optimal alignment, we trace back from i_maximum_score, j_maximum_score position , terminating the trace back when we reach a cell with score 0 .The time and blank complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t.Local alignment with affine gap penaltyFor this problem, there are gap opening penalty and gap extension penalty. The gap opening penalty is applied at the start of the gap, and then the other gap following it is given with a gap extension pena lty.InitializationThere are Four different matrices up_score , left_score ,m_score , trace_back alter matrixFor i = 1 to n+1up_scorei0 = -gap_opening_penalty-(i-1)*gap_extension_penaltyFor j = 1 to m+1left_score0j = -gap_opening_penalty-(j-1)*gap_extension_penaltyFor i = 1 to n+1For j = 1 to m+1up_score ij = max(up_score ij-1 gap_extension_penalty,m_scoreij-1 gap_opening_penalty)Left_scoreij = max(left_scorei-1j gap_extension_penalty,m_scorei-1j gap_opening_penalty)m_scoreij = BLOSUM62 (si, tj)) +max(m_score i-1j-1,left_score i-1j-1,up_score i-1j-1)scores = left_scorei-1j-1, m_scorei-1j-1 ,up_scorei-1j-1, 0We find the highest scoring cell, the position of that cell,and the best alignment by following the same steps as we accomplished in the previous problem.The time and space complexity of this algorithm is O(mn).Global alignment with constant gap penaltyIn this case every gap receives a fixed score, regardless of the gap lengthFor i = 1 to m+1alignment_matrixi0 = -gap_penaltyFo r i = 1 to n+1alignment_matrix0j = -gap_penaltyFor i = 1 to n+1For j = 1 to m+1scores = alignment_matrixij-1 gap_penalty,alignment_matrixi-1j gap_penalty, alignment_matrixi-1j-1 + BLOSUM62 (si, tj),)alignment_matrixij = max(scores)alignment_matrixmn holds the optimal alignment score.The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t.Global alignment with scoring matrixIn this problem there is a linear gap that each inserted or deleted symbol is charged g as a result, if the length of the gap L the total gap penalty would be the product of the two gL.For i = 1 to m+1alignment_matrixi0 = -i*gap_penaltyFor i = 1 to n+1alignment_matrix0j = -j*gap_penaltyscores = alignment_matrixij-1 gap_penalty,alignment_matrixi-1j gap_penalty, alignment_matrixi-1j-1 + BLOSUM62 (si, tj),)alignment_matrixij = max(scores)alignment_matrixmn holds the optimal alignment score.The time and space complexity of this algorithm is O (mn) which m is the length of sequence s , and n is the length of sequence t.Global alignment with scoring matrix and affine gap penaltyThere are Four different matrices up_score , left_score ,m_score , trace_backFilling matrixFor i = 1 to n+1up_scorei0 = -gap_opening_penalty-(i-1)*gap_extension_penaltyFor j = 1 to m+1left_score0j = -gap_opening_penalty-(j-1)*gap_extension_penaltyFor i = 1 to n+1For j = 1 to m+1up_score ij = max(up_score ij-1 gap_extension_penalty,m_scoreij-1 gap_opening_penalty)Left_scoreij = max(left_scorei-1j gap_extension_penalty,m_scorei-1j gap_opening_penalty)m_scoreij = BLOSUM62 (si, tj)) +max(m_score i-1j-1,left_score i-1j-1,up_score i-1j-1)maximum_alignment_score = max(m_scoremn, left_scoremn, up_scoremn)The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t.The above algorithms require too frequently time for searching large databases so we cannot use these algorithms. There are several methods to overcome this problem.Heuristic MethodIt is an algorithm that gives except approximate solution to a problem. Sometimes we are not able to formally prove that this solution actually solves the problem, but since heuristic methods are much faster than exact algorithms, they are commonly used . FASTA is a heuristic method for sequence alignment .The main idea of this method is choosing regions of the two sequences that have some degree of similarity, and using dynamic programming to compute local alignment in these regions. The disadvantage of using these methods is losing significant amount of sensitivity. Parallelization is a possible solution for solving this problem.14Parallel AlgorithmIn this paper 15 a parallel method is introduced to reduce the complexity of the dynamic programming algorithm for pairwise sequence alignment. The time consumption of sequential algorithm mainly depends on the computation of the score matrix .For calculating the score of each cell, the computation of F(i,j) can be started only when F(i-1,j-1), F(i-1,j) and F(i,j-1) acquire their values. Consequently, it is possible to conduct the computation of score matrix sequentially in order of anti-diagonals .So, the values in the same anti-diagonal can be calculated simultaneously. ( Figure 1 )Figure1 .Computing score matrix in parallel manner .The values of the cells marked by can be computed simultaneously.There are two models for problem solving using parallel method that improve the performance of the pairwise alignment algorithm.Pipeline model Each row of the score matrix is computed successively by a processor, which blocks itself until the required values in the above row are computed.Anti-diagonal model From the left-top corner to the right-bottom corner of score matrix, all processors compute concurrently along an anti-diagonal of the matrix. Each idle processor selects a cell from the current anti-diagonal and computes its value. When all values in curr ent anti-diagonal are computed, the computation moves on to next anti-diagonal.In the algorithm that is based on the pipeline model, the score matrix is partitioned into several blocks by column and several bands by row. completely the bands distributed to multiple processors, and each processor computes the block in its own band simultaneously.By applying parallel algorithm, The time complexity is O(n) when n processor is used. 15Progressive MethodFor solving multiple sequence alignment problems, the most common algorithm used is progressive method. This algorithm consists of three main stapes. First, comparing all the sequences with each other, and producing similarity scores ( distance matrix) . This stage is parallelized. The second stapes groups the most similar sequences together using the similarity scores and a clustering method such as Neighbor-Joining to create a guide tree. Finally, the third stage sequentially aligns the most similar sequences and groups of sequences un til all the sequences are aligned. Before alignment with a pairwise dynamic programming algorithm, groups of aligned sequences are converted into profiles. A profile represents the character frequencies for each column in an alignment. In the final stage, for aligning groups of sequences, trace back information from full pairwise alignment is required. 17 ClustalWThis algorithm that has become the most popular for multiple sequence alignment implements progressive method. The time complexity of this method is O (N 4 + L 2) and the space complexity is O (N2 + L 2). 18ConclusionBy comparing the different methods to implement pairwise sequence alignment and multiple sequence alignment , we can conclude that using parallel algorithms that implement pipeline model or anti-diagonal model are effective algorithm for performing pairwise sequence alignments. The algorithms that implement progressive method such as ClustalW are effective algorithm for solving multiple sequence alignments prob lems.ReferencesRobert F. Murphy, Computational Biology, Carnegie Mellon University www.cmu.edu/bio//LecturesPart03.ppthttp//en.wikipedia.org/wiki/Sequence_alignmentDan Gusfield, Algorithms on Strings, Trees and Sequences Computer Science and Computational Biology (Cambridge University Press, 1997).http//cs.calvin.edu/activities/blasted/intro03.htmlhttp//www.embl.de/seqanal/courses/commonCourseContent/commonMsaExercises.htmlPer Kraulis , Stockholm Bioinformatics Center, SBC ,http//www.avatar.se/molbioinfo2001/seqali-why.htmlhttp//iitb.vlab.co.in/?sub=41brch=118sim=656cnt=1Andreas D. Baxevanis, B. F. Francis Ouellett ,Bioinformatics A Practical Guide to the Analysis of Genes and Proteinshttp//amrita.vlab.co.in/?sub=3brch=274sim=1433cnt=1David S.Moss, Sibila Jelaska, Sandor Pongor, Essays in Bioinformatics, ISB 1-58603-539-8http//amrita.vlab.co.in/?sub=3brch=274sim=1431cnt=1Burr Settles, Sequence Alignment, IBS Summer Research Program 2008, http//pages.cs.wisc.edu/bsettles/ibs08/lectur es/02-alignment.pdfAoife McLysaght, Biological Sequence Comparision/Database Homology Searching, The University of Dublin, http//www.maths.tcd.ie/lily/pres2/sld001.htmRapid alignment methods FASTA and BLAST http//www.cs.helsinki.fi/bioinformatiikka/mbi/courses/07-08/itb/slides/itb0708_slides_83-116.pdfYang Chen, Songnian Yu, Ming Ling, Parallel Sequence Alignment Algorithm For Clustering System, School of Computer Enginnering and science, Shanghai UniversityHeitor S. Lope, Carlos R ,Erig Lima , Guilherme L. Morit , A Parallel Algorithm for Large-Scale Multiple Sequence Alignment , Bioinformatics laboratory/CPGE Federal University of Technology Paran Scott Lloyd, Quinn O Snel , Accelerated large-scale multiple sequence alignmentKridsadakorn Chaichoompu, Surin Kittitornkun, and Sissades Tongsima ,MT-ClustalW Multithreading Multiple Sequence Alignment

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.