FASTA- Definition, Programs, Working, Algorithms, Uses

FASTA is a pairwise sequence alignment tool that compares the nucleotides or protein sequences with the existing database and is a text-based format that can be read and written with the help of a text editor or word processor.

  • It carries the dynamic similarity sequence search between the protein and nucleotide sequences against the database and can be used to find the functional and evolutionary relationship between the sequences.
  • FASTA finds the local similarity between the sequences and calculates the statistical significance of a match.
  • Before using the more time-consuming optimized search, it uses the hit for the identification of the potential matches.
  • Ktup is the parameter in the FASTA tool that is used for controlling the speed and sensitivity and specifies the size of the word.
  • With the increase in the ktup, the number of background hits decreases, and initially, it checks for several nearby hits in the segment.
  • The time required to produce the results is comparatively less, due to which it becomes more sensitive than that of the BLAST program.
  • FASTA produces local alignment scores to compare the query sequences with every sequence in the database.
  • FASTA sequences are generally obtained by different methods, including the DNA sequencing method (Sanger method and Maxam-Gilbert method) and protein sequencing method (Edman Degradation reaction and Mass Spectroscopy).
FASTA
Figure: FASTA Format. Image Source: NCBI

FASTA programs

  • FASTA: it compares the nucleotide sequences with the other nucleotide sequence present in the database and again compares the protein query sequences with other protein sequences in the database.
  • FASTX and FASTY: it is used for performing the search for the comparison of the nucleotide sequences with that of the protein sequence database.
  • SSEARCH: It is a local alignment that performs the smith-waterman alignment between the query nucleotide sequence and the other nucleotide sequence or the query protein sequences with the other protein sequences in the database.
  • GGSEARCH: it uses the Global alignment algorithms for the comparison of the protein or the DNA sequences in the database and compares the query sequences that are 80% of the length of the query.
  • GLSEARCH: during this, the alignment is of two types that are global in the query and local in the database and used to compare the sequences of protein and DNA to that of the sequences present in the database.

Different parameters are used in the FASTA algorithms

  • Threshold: The threshold is the limit between the minimum and maximum values which can be used for the filtration of the words.
  • True Homology: True homology is the algorithm that tells how much the sequence is related to that of the query sequences.
  • E-value: With the scores that are assigned to an alignment between the two sequences, the value of E decreases exponentially.
  • Putative conserved domains: These are the domains that have various functionalities.

Working on FASTA algorithms

  • Firstly, nucleotide or the protein sequence is taken as input.
  • Then the ktup parameter controls the sensitivity and speed and also helps to specify the word. Then the word hit is used to identify the potential matches between the query sequence and the database sequences. The lesser the ktup value more is the more sensitivity and by default, the ktup is 2 for proteins and 4 or 6 for nucleotides. Similarly, it checks for the nearby hits in the segments.
  • After this, based on the match and mismatch it finds similar local regions and isolates the highest matches from the background hits. BLOSUMS50 is the scoring matrix that is used for the protein sequences and the identity matrix for nucleotide sequences.
  • Now, it finds the best local regions and saves them, and after it rescans and scores the local regions with a suitable scoring matrix.
  • The highest score of the sub-regions will be referred to as hit1 after taking the maximum sub-regions from the local regions and the sequences which have less than the cutoff value will be eliminated.
  • Finally, it uses the smith-waterman algorithms to calculate the optimal score for the whole alignment, and initial similarity is used to rank the library sequences.

Uses of FASTA

  • It is used for the identification of the species.
  • Used for the establishment of the phylogeny
  • For DNA mapping
  • FASTA is also used for understanding the biochemical functions of the protein.
  • Study the evolution of the species, from where that specific species evolved, or identify the ancestors.
  • Calculation of the molecular weight
  • Identification of mutations in the sequences by comparing those sequences with the reference sequences.

References

  1. https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
  2. https://www.ebi.ac.uk/Tools/sss/fasta/
  3. http://gensoft.pasteur.fr/docs/fasta/3.6/fasta_guide.pdf
  4. https://vlab.amrita.edu/?sub=3&brch=274&sim=1434&cnt=1
  5. https://watermark.silverchair.com/51.pdf
  6. https://pediaa.com/difference-between-blast-and-fasta/

Leave a Comment