Modern genomic research generates vast amounts of raw sequence data, which has created the need for biological databases to store and organize this enormous data. Biological databases are collections of biological data that are used for the storage and organization of data in a way that facilitates easy retrieval of information.
What are Primary Databases?
Primary databases are a type of biological database that contain original and unprocessed biological data. These databases typically consist of raw sequences, such as nucleotide or protein sequences, or structural information, such as molecular structures.
There are several primary sequence databases available that are widely used in the field of bioinformatics. The three main primary databases are GenBank at the National Center for Biotechnology Information (NCBI), the DNA Database of Japan (DDBJ), and the European Molecular Biology Laboratory (EMBL). Other examples of primary databases include Protein Data Bank (PDB), Gene Expression Omnibus (GEO), and ArrayExpress.
- GenBank is a primary biological database managed by the National Center for Biotechnology Information (NCBI). It is an annotated collection of publicly available sequences, which includes information about genes, proteins, and other genetic elements.
- GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC), which is a joint effort between three primary databases: GenBank, DDBJ, and EMBL. These organizations work collaboratively to share sequence data from around the world on a daily basis and ensure that the data in each database is up-to-date and accurate.
- The GenBank flat file format is used to represent the sequence data and annotations in the database.
- GenBank accepts mRNA or genomic sequence data with proper source organism information and annotation provided by the submitter. However, the database does not accept noncontiguous sequences, primer sequences, protein sequences without underlying nucleotide submission, mixed genomic and mRNA sequences, consensus sequences, or sequences with lengths of less than 200 nucleotides.
- To submit sequences to this database, there are several web-based tools available, including BankIt, Sequin, and tbl2asn.
- BankIt is a web-based submission tool that allows users to submit gene sequences to the GenBank database. It allows for the submission of sets of sequences.
- Sequin submission tool is used for more complex submissions, such as those containing long sequences, multiple annotations, or gapped sequences. Sequin is a stand-alone submission tool provided by NCBI that can be downloaded from the FTP site for use on Mac, PC, and UNIX platforms. To ensure maximum performance, each Sequin file should have fewer than 10,000 sequences.
- For even larger submissions, the tbl2asn submission tool should be used. Like Sequin, tbl2asn is a stand-alone tool that can be downloaded from the FTP site. The submitter can work offline to prepare the submission and then submit it using tbl2asn.
European Molecular Biology Laboratory (EMBL)
- EMBL (European Molecular Biology Laboratory) is a collection of nucleotide sequence data that is maintained by the European Bioinformatics Institute (EBI). It is also a part of INSDC along with the GenBank and DDBJ databases.
- EMBL’s main focus is on the storage and distribution of nucleotide and protein sequences, as well as providing tools and resources for researchers to analyze and interpret this data.
- Like other primary databases, EMBL collects and archives data from various sources, including scientific publications and direct submissions from researchers.
- One of the main features of EMBL is its user-friendly interface, which allows researchers to easily search for and retrieve data.
- EMBL also offers a range of tools and resources for sequence analysis, including alignment tools, phylogenetic trees, and protein structure prediction software.
- EMBL uses a sequence submission tool called Webin. This tool is web-based and can be accessed through EMBL’s website. With Webin, researchers can submit single sequences, multiple sequences, or a large number of sequences.
DNA Data Bank of Japan (DDBJ)
- DDBJ (DNA Data Bank of Japan) is a primary database that collects and stores genetic information, mainly from Japanese researchers. They also receive and assign accession numbers to researchers from other countries.
- DDBJ is also a member of INSDC and regularly exchanges collected data with EMBL and GenBank.
- Its main activities include collecting and exchanging nucleotide sequence data, managing bioinformatics tools for data submission and retrieval, developing tools for biological data analysis, and organizing Bioinformatics Training Courses in Japanese to teach people how to analyze biological data.
- DDBJ uses the newly developed web-based tool called the Nucleotide Sequence Submission System (NSSS) for sequence submissions. The NSSS replaced Sakura, beginning in November 2012. Sakura was used for sequence submission from 1995. In cases where the sequences are very long or numerous, DDBJ recommends using its Mass Submission System (MSS).
Protein Data Bank (PDB)
- PDB (Protein Data Bank) is a global database that stores information about the structure of biological macromolecules.
- It is managed by Research Collaboratory for Structural Bioinformatics (RCSB) and provides many services to help researchers access and analyze the structural data.
- It collects and archives the 3D-atomic level structural models of these macromolecules obtained through three commonly used experimental techniques: crystallography, nuclear magnetic resonance spectroscopy (NMR), and electron microscopy (3DEM).
- The database entries are mostly structures of proteins, although there are also entries for nucleic acids, carbohydrates, and theoretical models.
- In addition to the structural models, PDB also archives experimental data, associated metadata, and other details about the molecules.
Gene Expression Omnibus (GEO)
- GEO (Gene Expression Omnibus) is a public database that stores high-throughput gene expression and functional genomics data.
- It was created in 2000 as a resource for gene expression studies but has since expanded to include other types of data such as genome methylation and chromatin structure.
- The database requires that researchers provide raw data, processed data, and descriptive metadata.
- The original submitter-supplied GEO records are of 3 types: Platform, Sample, and Series. Platform describes the array or sequencer used, Sample describes the source and analysis of the sample, and Series links related Samples and describes a whole study.
- These records are organized into two categories: DataSet and Profile A DataSet is a curated collection of comparable Samples that share a common set of array elements. A Profile consists of expression measurements for a gene across all Samples in a DataSet.
Applications of Primary Databases
- Primary databases such as GenBank and EMBL can be used as a reference for genome analysis and comparison.
- The primary database PDB can be used for protein structure identification.
- Primary databases such as Gene Expression Omnibus (GEO) contain transcriptome data that can be analyzed to identify differentially expressed genes and to understand gene expression.
- Primary databases such as KEGG can be used to obtain information on metabolic and signaling pathways in various organisms.
- Baker, W., Camon, E., Hingamp, P., Sterk, P., Stoesser, G., & Tuli, M. A. (2000). The EMBL Nucleotide Sequence Database. Nucleic Acids Research, 28(1), 19-23. https://doi.org/10.1093/nar/28.1.19
- Choudhuri, S. (2014). Data, Databases, Data Format, Database Search, Data Retrieval Systems, and Genome Browsers. Bioinformatics for Beginners, 77–131. doi:10.1016/b978-0-12-410471-6.00005-0
- Mount, D. W. (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press.
- Primary and secondary databases | Bioinformatics for the terrified (ebi.ac.uk)
- Tateno, Y., Imanishi, T., Miyazaki, S., Fukami-Kobayashi, K., Saitou, N., Sugawara, H., & Gojobori, T. (2002). DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Research, 30(1), 27-30. https://doi.org/10.1093/nar/30.1.27
- Villalba, G. C., & Matte, U. (2021). Fantastic databases and where to find them: Web applications for researchers in a rush. Genetics and Molecular Biology, 44(2). https://doi.org/10.1590/1678-4685-GMB-2020-0203
- Xiong, J. (2006). Essential Bioinformatics. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511806087
- Clough, E., & Barrett, T. (2016). The Gene Expression Omnibus database. Methods in molecular biology (Clifton, N.J.), 1418, 93. https://doi.org/10.1007/978-1-4939-3578-9_5