- As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown tremendously.
- The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural data produced by X-ray crystallography and macromolecular NMR.
- The biological information of nucleic acids is available as sequences while the data of proteins are available as sequences and structures. Sequences are represented in a single dimension whereas the structure contains the three-dimensional data of sequences.
- A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated.
- The database is complemented with generalized software for processing, archiving, querying and distributing data.
- Such databases consisting of nucleotide sequences are called nucleic acid sequence databases.
Nucleic acid Sequence Databases
The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery.
1. Primary databases of nucleotide sequences
- There are three chief databases that store and make available raw nucleic acid sequences to the public and researchers alike: GenBank, EMBL, DDBJ.
- They are referred to as the primary nucleotide sequence databases since they are the repository of all nucleic acid sequences.
- GenBank is physically located in the USA and is accessible through the NCBI portal over the intern.
- EMBL (European Molecular Biology Laboratory) is in UK and DDJB (DNA databank of Japan) is in Japan.
- All three accept nucleotide sequence submissions and then exchange new and updated data on a daily basis to achieve optimal synchronization between them.
- These three databases are primary databases, as they house original sequence data.
- They collaborate with Sequence Read Archive (SRA), which archives raw reads from high-throughput sequencing instruments.
The GenBank sequence database is open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information (NCBI) as part of the International Nucleotide Sequence Database Collaboration (INSDC). receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months.
b. EMBL (European Molecular Biology Laboratory)
The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is a comprehensive collection of primary nucleotide sequences maintained at the European Bioinformatics Institute (EBI). Data are received from genome sequencing centers, individual scientists and patent offices.
c. DDBJ (DNA databank of Japan)
It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is the only nucleotide sequence data bank in Asia. Although DDBJ mainly receives its data from Japanese researchers, it can accept data from contributors from any other country.
2. Secondary databases of nucleotide sequences
- Many of the secondary databases are simply sub-collection of sequences culled from one or the other of the primary databases such as GenBank or EMBL.
- There is also usually a great deal of value addition in terms of annotation, software, presentation of the information and the cross-references.
- There are other secondary databases that do not present sequences at all, but only information gathered from sequences databases.
a. Omniome Database:
Omniome Database is a comprehensive microbial resource maintained by TIGR (The Institute for Genomic Research). It has not only the sequence and annotation of each of the completed genomes, but also has associated information about the organisms (such as taxon and gram stain pattern), the structure and composition of their DNA molecules, and many other attributes of the protein sequences predicted from the DNA sequences.
It facilitates the meaningful multi-genome searches and analysis, for instance, alignment of entire genomes, and comparison of the physical proper of proteins and genes from different genomes etc.
b. FlyBase Database:
A consortium sequenced the entire genome of the fruit fly D. Melanogaster to a high degree of completeness and quality.
It is a repository of not only the sequence but also the genetic map as well as phenotypic information about the C. Elegans nematode worm.
- Xiong J. (2006). Essential Bioinformatics. Texas A & M University. Cambridge University Press.
- Arthur M Lesk (2014). Introduction to bioinformatics. Oxford University Press. Oxford, United Kingdom
- Kaminuma E, Kosuge T, Kodama Y, et al. (January 2011). “DDBJ progress report”. Nucleic Acids Res. 39 (Database issue): D22–7.