Local and Global Alignment and Multiple Sequence Alignment

Local and Global Alignment and Multiple Sequence Alignment
Local and Global Alignment and Multiple Sequence Alignment

What is Local Alignment?

Local Alignment is a type of pairwise alignment where two sequences are not assumed to be similar over the entire length, and it finds the regions with the highest level of similarity between the two sequences. Since the two sequences may be of different sizes, the local alignment aligns the specific sequences of the query with that of the target sequence of the same specific region. It finds a high level of matches between the sequences without considering the alignment of the rest of the sequence regions. It is suitable for aligning more distantly related sequences. It is used for finding the conserved patterns in DNA sequences. Some of the examples of tools used during local alignment include BLAST, LALIGN, and EMBOSS water.

Smith-waterman and Dynamic Programming

  • In 1981 Smith and Waterman adapted the algorithm of Needleman and Wunsch for Local Alignment.
  • The local alignment aligns the regions of the sequences where a higher similarity is observed.
  • It involves initialization, scoring, and tracing of a matrix of the two aligned sequences where the rows and columns correspond to the base or residues.
  • The first rows and the first columns during the local alignment cases are filled with zeros, and the remaining cells are filled with a metric value derived from the neighboring values.
Smith-Waterman alignment
Figure: (A) shows an empty matrix, initialized for a Smith-Waterman alignment. (B) and (C) are alignments calculated using the specified scoring parameters. Image source: https://www.researchgate.net/figure/Smith-Waterman-local-alignment-example-A-shows-an-empty-matrix-initialized-for-a_fig2_24309044
  • The match score is added to the value of the neighbor diagonal only if the current cells correspond to a match or have an identical base otherwise, the mismatch score is used.
  • The match score during this method is a positive number and large in magnitude, while the scores of the gap and mismatch are generally small or zero.
  • This method is followed starting from the upper left corner of the matrix and proceeding to the lower right corner.
  • Here, first, the largest number in the matrix is searched, then the path is traced back until zero is reached, moving a step ahead to a cell responsible for another cell’s value.
  • It is one of the guaranteed and robust methods to achieve the best alignment for a given set of scores and penalties.
  • By the use of the local alignment techniques, multiple alignments are possible for any given matrix.
  • Similarly, for protein-coding sequences, weight matrices were developed in the late 1970s in an attempt to overcome the score and penalties problems with the alignments.
Scoring for best alignment
Scoring for best alignment. Image source: https://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1

Here, the two of the alignments that are shown in the above figure can be given with a score for matching as +5, mismatch as -3, and gap penalties as -4. Thus, the best alignment can be chosen by knowing the maximum score after summing up all the individual scores and the alignments. While summing up both of the alignments, we get a score of 18 in both aligned sequences, and here we can say that both the alignments are the best.

What is Global Alignment?

Global Alignment is a type of pairwise sequence alignment where two sequences are generally similar over the length as it is the end-to-end alignment of two strings and takes account of entire sequences. A global alignment algorithm is of Needleman-Wunsch algorithm and is usually done for comparison of the homologous genes like comparing the two genes having the same function and is suitable for the alignment of two very closely related sequences. Mainly EMBOSS Needle tool is used for Global sequence alignment as It is more applicable for the sequences that are closely related and are basically of the same length.

Local Alignment and Global Alignment. Image source: https://www.majordifferences.com/2016/05/difference-between-global-and-local.html

Needleman and Wunsch Alignment

  • Needleman and Wunsch adapted the dynamic programming methods in 1970 to solve the difficult problem of global sequence alignment.
  • Global sequence aligns the sequences of every amino acid and nucleotide found throughout the length.
  • This alignment is also used for making multiple sequence alignments.
  • As the statistical value (E-Value) does not apply to the sequences in global alignments, it is not useful at all for discovering similarities.
  • There are other different reasons for making a global alignment that includes:
    • For checking the minor differences between the two sequences.
    • For the analysis of polymorphisms like SNPs between the closely related species.
    • Comparing two sequences that partially overlap with each other.
  • During this method, the dynamic program solves the original problem by dividing the problems into smaller independent sub-problems.
  • The global sequence alignment method of scoring is simple where a positive and higher value is assigned for a match while a negative and lower value is assigned for a mismatch.

Alignment Matrices

There are different alignment matrices that are required for meaningful scoring and include:

PAM Matrices (Substitutions Matrices)

It was essential for implementing a meaningful scoring system for nucleotide and amino acid substitutions so as to increase the specificity of alignment algorithms and for providing the means of evaluation of their statistical significance. So the first scoring or weighing matrices were developed by Dayhoff et al. in 1978 from substitutions that have been observed during evolutionary history. A Group of the protein sequences with more than 85% sequence similarity was analyzed, and their 1571 substitutions were cataloged with that of the Dayhoffs PAM matrices. Between two of the given amino acids, each cell of the PAM matrices corresponds to the frequency of substitutions per 100 residues. Similarly, each of the PAM matrices corresponds with the specific evolutionary distance, and each one is simply the extrapolation of the original. For example, the PAM1 matrix is multiplied 250 times by itself for the construction of a PAM250 and is viewed as a typical scoring matrix for proteins that have been separated for 250 million years of evolution.

BLOSUM Matrices

In 1992, Henikoff and Henikoff developed the BLOSUM matrices to overcome some of the drawbacks of PAM matrices. As these matrices are based on the BLOCKS database, they organize proteins into a database. Each block that is defined by the alignment of motifs corresponds to the family. BLOSUM matrices are each calculated separately using conserved motifs at or below specific evolutionary distances, whereas PAM matrices were calculated with proteins at the identity of at least 85%. As the BLOSUM matrices are based on larger datasets, it is a more robust and accurate method for detecting similarity at greater evolutionary distances.

Dot plot

A dot plot is one of the ways to visualize the similarity between two proteins or nucleic acid sequences. These are the two-dimensional matrices that were introduced by Gibbs and Mclntyre in 1970, where the protein sequences are compared along the vertical and horizontal axes. It is used for the alignment of the individual sequence regions but is a time-consuming process when analyzed on a large scale. For the construction of the dot matrix plot, along the top row and the leftmost column of a two-dimensional matrix the two sequences are written. Then, a dot is placed at the place or region where the characters in the appropriate column match. Finally, a single line along the matrix main diagonal appears for very closely related sequences. The remaining isolated dots represent random matches.

There are some problems related to this dot plot that includes noise, lack of clarity, non-intuitiveness, difficulty extracting match summaries statistics, and match positions on the two sequences.

Alignment with Gap and Penalties

For the maximization of the biological meanings, dynamic programming algorithms use gap penalties, and the introduced gaps are subtracted. Gap open and Gap extension are the two different types of gap penalties. When there is insertion or deletion into the sequences then the gap score defines the penalty given to the alignment. At certain conditions, we can see the continuous gaps all along the sequences during the evolution, and during this, a linear gap penalty would not be appropriate for the alignment. Thus to solve this problem, gap open and gap extensions have been introduced when there are gaps of more than five. To the open of this gap, the open penalty is always applied, and a gap extension penalty is given to the other gaps following it, which is less compared to that of the open penalties. values for gap opening are -12 and -4 for gap extension.

What is Multiple Sequence Alignment?

Multiple sequence Alignment is the method for the alignment of multiple sequences as it can compare three or more biological sequences and is a global multiple sequence alignment. In this kind of alignment, it first aligns the most closely related pair of sequences and then next goes for the most similar one, and so on, as it consists of more complicated and sophisticated algorithms. It is applicable for the detection of variable regions between the sequences that include Phylogenetic analysis, for detection of homology genes between the new sequences and the existing sequences in the database, and for detecting homology in multiple sequences for a better understanding of the sequences. MUSCLE, CLUSTALW, MAFFT, and T-Coffee are some of the examples of multiple sequence alignment tools that are used frequently.

Multiple sequence alignment can be formed in R without the use of traditional algorithms like MUSCLE, and for this, we can use DECIPHER packages.

Multiple Sequence Alignment
Multiple Sequence Alignment. Image source: https://www.majordifferences.com/2016/05/difference-between-pairwise-and-multiple-sequence-alignment.html

Guide Tree in Multiple Sequence Alignment

  • Guide trees are the trees that are used for deciding the order of sequence alignment in the progressive multiple sequence alignment heuristics. 
  • Hence, over the years of the time period, a considerable amount of effort has been used to make it accurate and quick, and often in making large alignments, these are the limiting factors.
  • Generally, guide trees are used in many multiple sequence alignment methods with the conjugation of progressively alignment techniques to generate multiple sequence alignment.
  • It is believed that the better guide tree will give an alignment that has higher accuracy.
  • According to the topology of the guide tree, more related sequences are aligned first, and the less related are aligned afterward.
  • Flowing the branching order in the tree, the guide tree is used to align the sequences into progressively larger and larger alignments.
  • Based on the distance matrix that has been generated from the pairwise scores, a guide tree is basically calculated. Similarly, the output of the guide can be found in the dnd file format.
  • Some guide tree construction schemes are based on pair-wise distance amongst unaligned sequences.

Applications of Sequence Alignment

  • Sequence alignment is used for the identification of unknown sequences.
  • It is also used for finding the other members of multigene families.
  • Extraction of the information for designing the primers.
  • Compiling the string DNA sequences and reconstructing them into long sequences of DNA.
  • It is also applicable for the determination of the physical and genetic maps under various experimental protocols.
  • Prediction of the actual function of the gene products.
  • Getting and gathering information for molecular modeling.
  • Useful for the analysis of the structural, functional, and evolutionary analysis of sequences.
  • It is also applicable for sequence homology and sequence similarity.
  • It is used for observing the conserved domains or elements in the sequences.
  • For the identification of the probes for similar sequences in other organisms.
  • It is also used for the development of PCR primers.
  • It is applicable for the phylogenetic analysis between two different organisms or within the same organisms.
  • It is used for the alignment of the primary sequences of DNA, RNA, or proteins to identify regions of similarity that may be related to functional, structural, and evolutionary relationships between the sequences.
  • It helps in the determination of different mutated regions in the sequences by comparing the query sequence with the reference sequence.

References

  1. https://www.majordifferences.com/2016/05/difference-between-global-and-local.html
  2. https://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1
  3. https://bio.libretexts.org/Bookshelves/Computational_Biology/Book%3A_Computational_Biology_-Genomes_Networks_and_Evolution(Kellis_et_al.)
  4. https://vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1
  5. https://omicstutorials.com/interpreting-dot-plot-bioinformatics-with-an-example/
  6. https://webstor.srmist.edu.in/web_assets/srm_mainsite/files/files/5(6).pdf
  7. https://sites.google.com/site/pairwisesequencealignment/dot-plot
  8. https://www.labxchange.org/library/items/lb:LabXchange:24d0ec21:lx_image:1
  9. http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec03/node10.html
  10. http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec03/node9.html
  11. http://www.cs.rice.edu/~ogilvie/comp571/2018/09/04/pam-vs-blosum.html
  12. https://www.bionity.com/en/encyclopedia/Gap_penalty.html
  13. https://academic.oup.com/peds/article/19/3/129/1524388
  14. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-338
  15. https://www.researchgate.net/post/Concerning-both-method-and-function-what-are-the-main-differences-between-a-phylogenetic-tree-and-a-guide-tree
  16. https://www.researchgate.net/publication/5621575_The_effect_of_the_guide_tree_on_multiple_sequence_alignments_and_subsequent_phylogenetic_analyses
  17. https://www.pnas.org/doi/10.1073/pnas.1405628111
  18. https://www.slideserve.com/bambi/multiple-sequence-alignments
  19. https://www.slideserve.com/elmo/pairwise-and-multiple-sequence-alignments
  20. https://pubmed.ncbi.nlm.nih.gov/10463075/
  21. https://www.cs.mcgill.ca/~rwest/wikispeedia/wpcd/wp/s/Sequence_alignment.htm

Leave a Comment