- Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics.
- Gene prediction basically means locating genes along a genome. Also called gene finding, it refers to the process of identifying the regions of genomic DNA that encode genes.
- This includes protein coding genes, RNA genes and other functional elements such as the regulatory genes.
Importance of Gene Prediction
- Helps to annotate large, contiguous sequences
- Aids in the identification of fundamental and essential elements of genome such as functional genes, intron, exon, splicing sites, regulatory sites, gene encoding known proteins, motifs, EST, ACR, etc.
- Distinguish between coding and non-coding regions of a genome
- Predict complete exon – intron structures of protein coding regions
- Describe individual genes in terms of their function
- It has vast application in structural genomics ,functional genomics , metabolomics, transcriptomics, proteomics, genome studies and other genetic related studies including genetics disorders detection, treatment and prevention.
Bioinformatics and the Prediction of Genes
- With databases of human and model organism DNA sequences increasing quickly with time, it has become almost impossible to carry out the conventional painstaking experimentation on living cells and organisms to predict genes.
- Formerly, statistical analysis of the rates of homologous recombination of several different genes could determine their order on a certain chromosome, and information from many such experiments could be combined to create a genetic map specifying the rough location of known genes relative to each other.
- However, today, the frontiers of bioinformatics research are making it increasingly possible to predict the function of such a deluge of genes based on its sequence alone.
Methods of Gene Prediction
Two classes of methods are generally adopted:
A. Similarity based searches
It is a method based on sequence similarity searches.
- It is a conceptually simple approach that is based on finding similarity in gene sequences between ESTs (expressed sequence tags), proteins, or other genomes to the input genome.
- This approach is based on the assumption that functional regions (exons) are more conserved evolutionarily than nonfunctional regions (intergenic or intronic regions).
- Once there is similarity between a certain genomic region and an EST, DNA, or protein, the similarity information can be used to infer gene structure or function of that region.
- Local alignment and global alignment are two methods based on similarity searches. The most common local alignment tool is the BLAST family of programs, which detects sequence similarity to known genes, proteins, or ESTs.
- Two more types of software, PROCRUSTES and GeneWise , use global alignment of a homologous protein to translated ORFs in a genomic sequence for gene prediction.
- A new heuristic method based on pairwise genome comparison has been implemented in the software called CSTfinder.
B. Ab- initio prediction
It is a method based on gene structure and signal-based searches.
- It uses gene structure as a template to detect genes
- Ab initio gene predictions rely on two types of sequence information: signal sensors and content sensors.
- Signal sensors refer to short sequence motifs, such as splice sites, branch points, polypyrimidine tracts, start codons and stop codons.
- On the other hand content sensors refer to the patterns of codon usage that are unique to a species, and allow coding sequences to be distinguished from the surrounding non-coding sequences by statistical detection algorithms. Exon detection must rely on the content sensors.
- The search by this method thus relies on the major feature present in the genes.
- Many algorithms are applied for modeling gene structure, such as Dynamic Programming, linear discriminant analysis, Linguist methods, Hidden Markov Model and Neural Network.
- Based on these models, a great number of ab initio gene prediction programs have been developed. Some of the frequently used ones are GeneID, FGENESH, GeneParser, GlimmerM, GENSCAN etc.