Databases and Biological Data Mining in Bioinformatics

What are Primary Databases?

Primary databases are the databases that are generated from the primary sources with experimentally derived data such as nucleotide sequences, protein sequences, or macromolecular structures. The experimental results that are obtained from different experiments are directly submitted to the research base by the researchers or primary data providers. The data in the primary database are never changed once the database is provided with the accession number. It is also known as an archival database and has a high level of data duplication. Some of the examples of Primary databases are Gene Bank, DDBJ, and PDB.

What are Secondary Databases?

A secondary database is a database that consists of the data derived from the result of analyzing primary data, and they are also known as a curated database or derived database. They often use complex computational algorithms and manual analysis along with interpretation to derive new knowledge from the public record of science and have become one of the references for Molecular Biologists over the past decade. Secondary databases have a low level of duplication of data as it is curated, and this secondary data consists of significant data in the form of conserved sequences, signature sequences, active site residue of proteins, etc. some of the examples of secondary databases include Pfam, InterPro, PROSITE, BLOCKS, PRINTS, OMIM.

What are Structural Databases?

Structural databases are the tools that are essential for all crystallographic work and needed at several stages of the process for producing, solving, refining, and publishing the structure of new material. Structural databases are used to verify the results of a structure refinement by finding structures that have comparable bond distance, bond angles, or coordinate environments to a new structure. The structure database provides the information related to the chemical compound name, formula, and oxidation states of the elements that are present. Similarly, it provides information on the contents (number of formula units per unit cell), dimension, and symmetry of the unit cell, along with the symmetry of the structure, atomic coordinates, occupancies, and thermal parameters. The structures present in the database are solved by the use of X-ray, neutron, and electron diffraction techniques on the samples that are primarily single crystals.

3D Macromolecular Structure Tools

  • Cn3D: Cn3D is the web browser that allows users to visualize 3-dimensional structures from NCBI’s retrieval service. It has powerful annotation, and alignment editing features that display structure, sequences, and alignments which can be run on Windows, Mac, and Unix.
  • DeepView: DeepView is an application that provides users a platform to analyze several proteins at the same time where the proteins are superimposed to deduce structural alignment and compare their active sites or any other related parts. Because of this tool, it is easy to obtain Amino acid mutations, H-bond, angles, and distance between the atoms.
  • Povray: Povray is the software tool for the visualization of the protein structure, and when it is used with a Swiss-PDB viewer, then the image obtained is more sharpened, and the colors are more vivid.
  • RasMol: RasMol is a powerful software tool for the visualization of macromolecular structure and its relation to the function. It can be operated on Mac or Windows computers and rotate the protein, or the DNA structure shows the 3D structure.
  • RCSB Protein Database: it provides various tools and resources for studying the structure of biological macromolecules and their relationships to sequences, functions, and diseases and also offers tools for browsing, searching, and reporting the data. The Research Structure for Collaboratory for Structural Bioinformatics (RCSB) has been working to bring changes and make the study of the 3D structure molecules easier and more efficient.
Databases and Biological Data Mining in Bioinformatics
Figure: Databases and Biological Data Mining in Bioinformatics. Image Source: Respective Database Website.

Data Mining of Biological Databases

Biological Databases are libraries of biological sciences that are collected from different scientific experiments, published kinds of literature, high-throughput technologies, and analysis of the data through In-Silico and computational approaches. Similarly, the discovery of the knowledge from the use of biological data from the database is called Biological data mining.

  • In the clinical context, biologists and clinicians are stepping up their efforts and intelligence to unravel the biological process that underlines the disease pathway.
  • Because of this high amount of data is generated in the biological and clinical fields, from genome sequencing to DNA microarray, proteins, and small molecule structures, biomolecular interactions, diseases pathway, biomedical images, and electronic health records.
  • Thus, a high amount of biological data present in the database has led to our ability to mine and analyze the data effectively. At present, research invests time in data mining rather than data generation.
  • The data mining process involves the collection, selection, and transformation of the data along with the visualization and evaluation of the extracted data.
  • Data mining employs different techniques and algorithms from statistics, machine learning, artificial intelligence, database, and data warehousing, and it involves classification, clustering, association and sequence analysis, and regression.
  • Hence, Biological data mining will be one of the crucial ways to develop a better understanding of intrinsic diseases mechanism to discover new drugs and develop valuable decisions in the health and clinical sectors that will ultimately benefit patients.
  • However, translating the vast amount of biological data into valuable insights is not easy as it requires the proper handling of noisy and incomplete data.

Application of Data mining in Bioinformatics

  • Sequence analysis: sequence analysis consists of finding the part of the biological sequences that are alike and identifying the regions or parts that differ during medical analysis and genome mapping process.
  • Genome Annotation: The process of marking the genes and other biological features in a DNA sequence is known as Genome annotation in the context of the Genome field.
  • Analysis of gene expression: For the analysis of gene expression, it is necessary to measure RNA levels which are done by the use of various techniques like microarray, express cDNA sequencing tag, massively parallel signature sequencing, or various application of multiplexed in-suit hybridization, etc.
  • Analysis of Protein expression: In the sample, the presence of proteins is known by the use of the techniques like Protein microarray and high throughput mass spectroscopy.
  • Analysis of mutations in cancer: the variety of point mutations in a cancer gene is identified by the massive sequencing techniques.
  • Protein structure prediction: In most cases, primary structure generally determines a structure, and this primary structure can be easily determined from the sequence of the gene that codes for it.
  • Comparative genomics: The study of the genome structure and function across different biological species is called comparative genomics and is applicable for finding the gene and discovering new, non-coding functional elements of the genome.
  • Modeling biological systems: It is one of the significant tasks of system biology and mathematical biology as it aims to develop and use efficient algorithms, data structure, visualization, and communication tools for the integration of vast quantities of biological data with the goal of computer modeling for greater efficiency.
  • Protein-Protein docking: With the use of technology like X-ray crystallography and protein nuclear magnetic resonance spectroscopy, several three-dimensional protein structures have been determined, and a variety of methods are used for the determination of Protein-protein docking.

Biological File Formats

There are different file formats that are used in Bioinformatics, and some of them are listed below:

  • FASTA: The FASTA format is a very basic format with a minimum of two lines and is a simple way of representing nucleotide or amino acid sequences of nucleic acids and proteins. Different file extensions of fasta format are file.fa, file.fasta, file.fsa. The sequence of the FASTA format begins with a single-line identifier description that is then followed by the line of the DNA sequence data, and this identifier description is symbolized by the > sign.
Figure: FASTA Format. Image Source: NCBI
  • FASTQ: It is one of the most widely used formats in sequence analysis as it contains much more information than FASTA. During this format, each sequence requires four lines where the first line is the sequence header which starts with an @, the second line is the sequence, the third line starts with ‘+’, and the fourth line is the quality scores.
  • SAM (Sequence Alignment Map): The SAM format is the text format that is used for storing data in a series of tabs, and most of the time, it generates the version like the BAM format where humans are able to read it and this format stores the data in a compressed, indexed, binary form. SAM format is generally generated following the mapping of the reference sequence reads and is a text format with a head and a body. The header line starts with @ and holds the generic information in the SAM format. The body of the file consists of the alignment records. The file extension of SAM format is file. sam.
  • BAM: The BAM (Binary Alignment/Map) file format is the compressed version of the SAM file format, which represents a nucleotide sequence in the compact form. The data between the SAM and BAM is precisely the same, but binary BAM files are smaller in size and ideal for storing alignment files. For viewing the file, it requires samtools.
  • VCF (Variant Calling Format/File): VCF is a text file format with the header and the body, where the header consists of the meta information and is included after the ‘##’ string, and the body consists of the data line. Similarly, for a better explanation of the data file, it is recommended to include INFO, FILTER, and FORMAT entries. There are nine standard columns and an additional column for each sample in the VCF file, which is a highly flexible format. In the VCF file format, various information like alternate allele, assembly field, pedigree field, and sample field can be included. The file extension for the VCF file format is file. vcf.



Leave a Comment