Information Retrieval refers to the state-of-the-art approaches, processes, methods, and procedures of searching, locating, and retrieving recorded data and information from a file or database stored on a computer. Modern information retrieval is done by searching full-text databases, locating items from bibliographic databases, supply of the documents via a network platform.
What is Web Search and Data Retrieval?
- The approach for the retrieval of information employs two generic techniques, one is matching the words in the query against the database, and the second is traversing the database with the aid of hypertext or hypermedia links.
- Still, the search is not that accurate, and to improve it, various modifications have been introduced.
- One of the techniques for enhancement is that the search output is sorted by the degree of relevance based on the statistical match between the keyword in the query and the document.
- Similarly, in another technique, the program generates the new query automatically using one or more documents based on the user’s relevant considerations.
- Since the early 1960s, one of the dominant approaches for text retrieval has been keyword searching.
- There has been a significant change in the system and technique of information retrieval with the exponential growth and use of computer networks.
Web-based Molecular Biology search tools
The study of the biological macromolecules (DNA and Proteins) at the structural and functional level is known as Molecular Biology. There are many free resources available on the internet for the study of these macromolecules, and some of them are listed below.
- BYU DNA sequencing Centre Resources: The DNA sequencing Centre is a web search tool that helps researchers and students to process DNA samples efficiently and effectively. It has reduced the overall cost and has increased the quality of the generated data.
- DBGET: DBGET is a simple database retrieval system to find and obtain specific entries of diverse databases and is simply called the sequential collection of entries that may be stored in a single file or multiple files. As each database is given a unique identifier but the Molecular database can be retrieved uniformly by the combination of the database name and the identifier.
- European Bioinformatics Institute: European Bioinformatics Institute (EBI) is a center for research and service in bioinformatics that manages the database of biological data including nucleic acid, protein sequences, and macromolecular structure.
- Expasy: Expasy is the Molecular server that is dedicated to the analysis of proteins and nucleic acid including identification and characterization with peptide mass fingerprinting data for proteins.
- Java-based Molecular Biologist’s workbench: This we-search tool contains a workbench of tools for the analysis of the DNA and proteins like data entry, data manipulation, data analysis, and primer design.
- National Center for Biotechnology Information (NCBI): NCBI is the search tool that is used for understanding fundamental molecular and genetic processes that control health and diseases. It contains the links to different other datasets like Genebank database, BLAST, MapViewer, Human-mouse Homology map, Cancer Genome Anatomy Project along with the access to Entrez (a retrieval search system for the search of several linked databases like PubMed, Nucleotide sequencing database, Protein Sequencing database, structure, genome) and other online books.
- National Center for Genome Resources (NCGR): various genome-related project links and information are present in the NCGR.
- Google: Google is the search engine that can provide access to many search items to many web searchers and includes the ability to search images and products along with other features.
- Google Scholar: Google Scholar is one of the specialized search tools that focus primarily on information from scholarly and peer-reviewed sources. It gives access to many papers from Microbiology, Molecular Biology, Medicine, Diseases, and Immunology, to various other Bioinformatics-related papers.
- Science Research: Science Research is a search tool similar to Google Scholar that gives access to various peer-reviewed journals.
- ScienceDirect: ScienceDirect is also a web-search tool that is a multidisciplinary peer-reviewed journal article database that covers the research in the field of science, technology, medicine, social science, and humanities.
Data retrieval tools
Data retrieval is the process of identifying and extracting information and data from the database based on the query information provided by the user or applicant. Based on the query information, the database scans for the data that has been requested.
Some of the data retrieval tools are:
PubMed is a free resource search and data retrieval tool for providing health both globally and personally related to biomedical and life science literature. It consists of more than 34 million citations and abstracts related to biomedical literature. It is open source and available to the public.
Nucleotide sequencing tools
There are several nucleotide sequencing tools that include:
Biosyn Gizmo Tools: it consists of a bundle of databases like siRNA, proteins, and peptide antigens) and other tools like Genetic Code Table, Nucleic acid, and Proteins calculations.
BLASTn: BLASTn searches for the nucleotide sequences in the database with that of the reference database and performs the search in DNA sequences.
Codon usage Database: codon usage database is the query box for searching the codon usage table of an organism and is a useful tool for the creation of the primers and probes where the search can be performed by the use of the Latin name.
Genomic Resources/ Gene Bank:
GenomeNet: GenomeNet is used for genome research and related research in the area of Molecular and Cell Biology which is a Japanese network of database and computational services. It was established in 1991 under the Human Genome program.
National Center for Genome Resources: it contains information related to various genome projects.
SoftBerry: Softberry is the leading developer of software tools for genomic research whose primary interest area and expertise are in the areas of Genome annotation, functional site identification in DNA and proteins, sequence database management, Genome comparisons, and Protein structure prediction.
UCSC Genome Browser: The University of California, Santa Cruz (UCSC) Genome browser website contains the reference sequences for browsing with the query sequence and contains the collections of genomes.
db GAP (NCBI): The database of genotype and phenotype (db GAP) provides the result and studies related to the interaction of the genotype and phenotype and includes genome-wide association studies, medical sequencing, molecular diagnostics assay, as well as an association between genotype and non-clinical traits.
Ensemble: Ensemble is the free online data retrieval tool that produces genome databases for vertebrates and other eukaryotic species.
Protein Sequence Analysis Tool
Some of the protein sequence analysis tools are listed below:
- Expasy: It is a molecular server that provides the information for the protein and nucleic acid sequences for the analysis. Protein identification and characterization tool include identification and characterization with MS/MS data, identification with an isoelectric point, molecular weight, and amino acid compositions.
- Frame plot: Frame Plot is used for the prediction of the protein-coding regions in the Bacterial DNA.
- MPEx: Membrane Protein Explorer (MPEx) is a software tool that is used for exploring the topology and other features of membrane proteins using hydropathy plots based on thermodynamic principles.
- PredictProtein: PredictProtein is the service tool for the analysis of sequences and prediction of the protein structure and function. Here, the submission of the protein sequence or alignments is performed by the user interest where ProteinPredict returns multiple sequence alignments, low-complexity regions, PROSITE sequence motifs, nuclear localization signals, regions lacking regular structures, and prediction of secondary structure. It also provides information on solvent accessibility, transmembrane helicase, globular regions, coiled-coil regions, structural switch regions, disulfide bonds, sub-cellular localization, and functional annotations. Besides this, CHOP domain assignments prediction of transmembrane strands, and inner residue contacts are also available.
- ProDom: Prodom is a protein domain family database constructed automatically by clustering homologous segments and the building of the ProDom is based on recursive PSI-BLAST search. Similarly, the non-fragmentary sequence also called the source protein sequences is derived from SWISS-PROT and TrEMBL databases.
- ProtScale: The profile produced by any amino acid scale on a selected protein is computed and represented by PortScale. A number value assigned to each type of amino acid is called an amino acid scale where the most used scales are the hydrophobicity or hydrophilicity scales, and the secondary structure conformational parameters scales. It may also include other forms of the scales which are based on different physical and chemical properties of different amino acids.
- Worldwide Protein Bank (wwPDB): it maintains a single protein databank archive of macromolecular structural data that has been made freely available to the researchers and the local community.
Entrez is a molecular biology database system which is produced by the National Center for Biotechnology Information (NCBI) that provides access to nucleotides and proteins sequence data, information on gene-centered and genome mapping, 3D structure data, PubMed, and many more other data. The Entrez retrieval system uses the intuitive user interface for the rapid search of the sequences and bibliographic data, and it covers more than 20 databases that include protein data sequences from SWISS-Prot, PIR interaction, and PDB along with nucleotide sequence data from GeneBank. The Entrez results can be viewed in various formats like FlatFile, FASTA, XML, and others. Similarly, Entrez Global Query is an integrated search and retrieval system that provides access to all the databases simultaneously with a single query string and user interface which can efficiently retrieve related sequences, structures, and references.
The following databases are searched by Entrez:
- OMIM: A comprehensive and authoritative knowledge base of human genes and genetic disorders compiled to search human genetic research and education is the Online Mendelian Inheritance in Man which was started by Dr. Victor A.
- OMIA: Online Mendelian Inheritance in Animals is the online search tool that collects data related to animal genes and other forms of genetic disorders.
- PubChem compound: it provides the data related to small unique molecules and their chemical structure.
- PubChem substances: it provides information on the data related to records of deposited chemical substances.
- PubChem BioAssay: it is the site that provides the information the data related to bioactivity screens of chemical substances.
- uniGene: it contains data related to gene-oriented clusters of transcripts sequences.
SRS (Sequence Retrieval System)
A sequence retrieval system (SRS) is an information indexing and retrieval system designed for libraries with a flat-file format that supports the data structure of these libraries by providing particular indices for implementing feature tables or taxonomic classifications. It is designed for file formats like EMBL nucleotide sequence databank, SWISS-Prot protein sequencing database, or the Prosite library of protein sequences consensus patterns. Since it can be readily customized, the system has the particular strength to use any definite set of databases. It is a network browser for databases in Molecular Biology and is one of the powerful search and retrieval systems.