The increase in the amount of biological data created a need for developing computational tools for managing and analyzing the data, which led to the creation of a new field called bioinformatics.
Bioinformatics is a rapidly growing field in biology that focuses on the development and application of computational tools to analyze and interpret biological data. Bioinformatics software, tools, and databases are used to process, store, analyze, and interpret biological data.
A large number of bioinformatics tools are currently available, each designed to address a specific need in biological research. Bioinformatics tools vary in complexity and format, ranging from simple command-line tools to complex graphical programs and standalone web services that can be accessed from different bioinformatics organizations or public institutions.
Biological databases are archives of biological data, including genetic and protein sequences, annotations, pathways, and disease information. These databases are used for the storage and organization of data in a way that allows easy retrieval of information.
Types of Biological Databases
Biological databases can be classified into the following three types based on their contents:
Primary databases are collections of unprocessed biological data, consisting of raw sequences or structural information. These databases are repositories of original information and are not modified in any way. Examples of primary databases include GenBank, PDB, and DDBJ.
Secondary databases contain information that has been processed or curated using computational or manual methods. The information in these databases is based on the original data from primary databases. Examples of secondary databases include PIR, SWISS-PROT, and Pfam.
Specialized databases are databases that are designed to serve a specific research interest. These databases are created with a particular focus on a specific organism or type of data. Examples of specialized databases include Flybase, the HIV sequence database, and the Ribosomal Database Project.
Some of the most popular biological databases are discussed below:
- GenBank is a comprehensive and well-annotated collection of nucleic acid sequence data developed by the National Center for Biotechnology Information (NCBI). It contains data for nearly all types of organisms.
- EMBL (European Molecular Biology Laboratory) is a nucleotide sequence database managed by the European Bioinformatics Institute (EBI). It is an extensive repository of primary nucleotide sequences that stores data on DNA and RNA, gene expression, protein, structure, pathways, and literature.
- DDBJ (DNA Data Bank of Japan) is a nucleotide sequence database that collects and maintains nucleotide sequence data from researchers. It is operated by the National Institute of Genetics in Japan, collaborating with the National Center for Biotechnology Information (NCBI) and the European Molecular Biology Laboratory (EMBL).
- PDB (Protein Data Bank) is a biological database that contains structural data of biological macromolecules. PDB stores the three-dimensional structural data for large biological molecules such as proteins, DNA, and RNA, determined by experimental methods such as X-ray crystallography and NMR spectroscopy.
- PIR (Protein Information Resource) is a publicly accessible database of protein informatics. PIR maintains three other databases: the Protein Sequence Database (PSD), the Non-redundant Reference (NREF) database, and the integrated Protein Classification (iProClass) database.
- PROSITE is a protein database that contains a large collection of protein patterns or profiles. These patterns are linked to documentation providing useful biological information on the protein family, domain, or functional site.
- Pfam: Pfam is a database of protein families and domains represented by multiple sequence alignments, profile hidden Markov models (HMMs), and annotations. The database is accessible online and is used by researchers worldwide for various applications, including genome annotation, protein classification, and protein structure prediction.
- KEGG (Kyoto Encyclopedia of Genes and Genomes) is a biological database that contains genomic, chemical, and systemic functional information used to study molecular-level information about various cellular processes, including metabolism, signaling, and diseases.
- OMIM: Online Mendelian Inheritance in Man (OMIM) is a freely available database of human genes and genetic disorders that contains detailed and referenced overviews of all known Mendelian genetic disorders and over 16,000 genes.
Importance of Biological Databases
- Biological databases allow for the organization of vast amounts of biological data in a structured manner.
- Biological databases are important resources for researchers that can aid in their research.
- Biological databases can be used to develop new bioinformatics tools and methods to drive further research.
- Biological databases also enable collaboration between researchers and facilitate data sharing and resources.
Along with the construction and curation of biological databases, bioinformatics also consists of the development of computational tools for sequence, structure, and function analysis.
Bioinformatics tools are user-friendly software programs that allow researchers to analyze biological data.
Types of Bioinformatics Tools
Bioinformatics tools can be classified into various categories based on their functionality, purpose, and complexity. Some of the widely used tools are:
Sequence Analysis Tools
These tools are used for analyzing nucleotide or protein sequences. They are also used for identifying homologous sequences and understanding the evolutionary relationships between different organisms. They include tools used for sequence alignment, sequence database searching, motif discovery, phylogeny, and genome assembly and comparison. Some popular sequence analysis tools include:
- BLAST (Basic Local Alignment Search Tool) is a widely used sequence similarity search tool that compares query sequences to a database of known sequences. It can identify similar sequences, infer evolutionary relationships, and identify potential functional domains within a sequence.
- ClustalW is a multiple sequence alignment program for DNA and protein sequences.
- T-Coffee is another widely used multiple sequence alignment tool that uses a combination of progressive and consistency-based alignment algorithms to produce accurate alignments. It is particularly useful for aligning distantly related sequences.
- MEME (Multiple EM for Motif Elicitation) is used for motif discovery and search. The MEME Suite is a software toolkit that performs four types of motif analysis: motif discovery, motif–motif database searching, motif-sequence database searching, and assignment of function.
- MEGA (Molecular Evolutionary Genetics Analysis) is a user-friendly software that provides many tools for phylogenetic analysis, including multiple sequence alignment, model selection, and tree inference.
- PHYLIP (Phylogeny Inference Package) is a collection of software applications used to determine the evolutionary relationships among species.
Structure Analysis Tools
These tools are used for analyzing the structure of proteins and nucleic acids. Structural analysis tools include tools for nucleic acid and protein structure comparison, classification, and prediction. Some popular structure analysis tools include:
- CN3D is a software package used to view and analyze three-dimensional structures of macromolecules, including nucleic acids and proteins. It provides tools for visualizing and manipulating the 3D structure of macromolecules.
- PyMOL is a molecular visualization tool used for three-dimensional molecular structure visualization, analysis, and animation. It can display molecules in various ways, including as ball-and-stick models, cartoons, or surface representations.
- RasMol is a molecular graphics program designed to visualize and display proteins, nucleic acids, and small molecules in a graphical format.
- ODELLER is a comparative protein structure modeling tool that is used for predicting protein structures by comparing them to a known protein structure.
Function Analysis Tools
These tools are essential for understanding the functions and relationships between different genes and proteins and for identifying key pathways involved in diseases. They include tools that are used for profiling gene expression, predicting protein-protein interaction, predicting protein subcellular localization and modeling metabolic pathways.
- GEO (Gene Expression Omnibus) is a public repository of gene expression data that provides tools for searching, downloading, and analyzing gene expression datasets.
- InterProScan is a software package that scans protein sequences against multiple databases of protein domains and families.
- COBRA Toolbox is a software package for constraint-based metabolic modeling that provides a suite of tools for simulating and analyzing metabolic networks.
- Pathway Tools is a software package for constructing and analyzing metabolic pathway models, which includes a database of curated metabolic pathways and tools for metabolic engineering.
Applications of Bioinformatics Tools
Bioinformatics tools have numerous applications, including:
- Sequence analysis and phylogenetic analysis tools are used to understand the evolutionary relationships and similarities between sequences.
- Structure and function analysis tools are used to annotate and identify new genes and proteins, as well as to predict their function.
- Protein sequence analysis tools are also used to predict the three-dimensional structure of proteins.
- Functional analysis tools are also used to identify and understand different metabolic pathways.
- They are also tools that are used to identify potential drug targets and to design new drugs.
- Branco, I., & Choupina, A. (2021). Bioinformatics: new tools and applications in life science and personalized medicine. Applied Microbiology and Biotechnology, 105(3), 937–951. doi:10.1007/s00253-020-11056-2
- Choudhuri, S. (2014). Data, Databases, Data Format, Database Search, Data Retrieval Systems, and Genome Browsers. Bioinformatics for Beginners, 77–131. doi:10.1016/b978-0-12-410471-6.00005-0
- G-Preciado, A., Peimbert, M., & Merino, E. (2009). Genome Sequence Databases: Types of Data and Bioinformatic Tools. Encyclopedia of Microbiology, 211–236. doi:10.1016/b978-012373944-5.00027-4
- Luo, J. Applied Bioinformatics Tools. Basics of Bioinformatics, 271-301. https://doi.org/10.1007/978-3-642-38951-1_9
- Mehmood MA, Sehar U, Ahmad N (2014) Use of Bioinformatics Tools in Different Spheres of Life Sciences. J Data Mining Genomics & Proteomics 5: 158. doi:10.4172/2153-0602.1000158
- Wu, C. H., Yeh, S. L., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R. S., Suzek, B. E., Vinayaka, C. R., Zhang, J., & Barker, W. C. (2003). The Protein Information Resource. Nucleic Acids Research, 31(1), 345-347. https://doi.org/10.1093/nar/gkg040
- Xiong, J. (2006). Essential Bioinformatics. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511806087.