Bioinformatics is the use of computational tools to organize, analyze, understand, visualize, and store data or information associated with biomolecules. The large and complicated data or information are really difficult to handle and become a challenging task to organize and analyze those data. So to process, analyze, visualize, and annotate data efficiently requires some coding.
Thus, an individual need to develop and understand some programing language to read, write, change, or optimize code dealing with data (and especially omics data). This does not mean writing the new code from the scratch but rather knowing and using libraries and packages efficiently and correctly or using software with a user interface.
- R Programming is the most widely used software in bioinformatics because of its data handling and flexibility capacity.
- R programming for Bioinformatics helps to explore the programming skills for solving bioinformatics and computational biology problems.
- Learning the skills and codes in R Programming opens the full possibility of computing and most of the bioinformatics tools exist with the command line.
- R is a powerful programing language in bioinformatics and is generally required during the publication of quality graphs and figures.
Working with DNA sequences and R Programming
To work with the DNA or RNA sequences it is necessary to understand the genomic sequences.
What is DNA?
- DNA (Deoxyribonucleic acid) is a hereditary material that is found in all organisms including all Prokaryotes, eukaryotes, and many viruses. Genetic information is coded by DNA for transferring inherited characteristics.
- DNA is located in the nucleus of the cell and is also called nuclear DNA, while some DNA can be found in the Mitochondria and is also called Mitochondrial DNA or mtDNA.
- DNA is composed of nucleotides intertwined in opposite directions and consists of a phosphate group, a sugar group, and a nitrogen base. There are four types of nitrogen bases called Adenine (A), Guanine (G), Thymine (T), and cytosine (C).
- In the DNA code, these are organized to form various “genomic elements” like introns, exons, promoters, etc.
- DNA undergoes replication to form another set of DNA, which is transcribed into intermediate RNA (mRNA) that can be further translated into proteins which are made up of Amino acids.
- Tri-nucleotide sequences also called codons appear in triplet forms which encode one of twenty amino acids.
For analyzing DNA sequences it is necessary to arrange or align the sequences. DNA sequences are aligned using different types of alignments. For this, we are using R to understand the alignment process and how these programs interpret the similarities between the DNA sequences.
Here we will learn about Pairwise sequence alignment and Multiple sequence alignment.
Pairwise Sequence Alignment and R Programming
Pairwise sequence alignment is used to identify regions of similarities by comparing biological sequences of either protein, DNA, or RNA. During this alignment, comparatively simple algorithms are used. It is an important first step for the structural and functional analysis of newly determined sequences. In recent times a lot of biological sequences are generated and these sequences need to be compared to draw the functional and evolutionary relationships between the known and unknown sequences.
The application behind pairwise sequence alignment is to find out conserved regions between two sequences. It is also applicable for similarity searches in databases to identify newly discovered sequences.
Some of the examples of pairwise sequence alignment tools include BLAST, LALIGN, EMBOSS Needle, and EMBOSS Water.
Methods of Pairwise sequence alignment
Two different alignment strategies are used: Global Alignment and Local Alignment
Global Alignment is a type of pairwise sequence alignment where two sequences are generally similar over the length. It is the end-to-end alignment of two strings and takes account of entire sequences. It is suitable for aligning two closely related sequences. A global alignment algorithm is of Needleman-Wunsch algorithm and is usually done for comparison of the homologous genes like comparing the two genes having the same function. Mainly EMBOSS Needle tool is used for Global sequence alignment. It is more applicable for the sequences that are closely related and are basically of the same length.
Local Alignment is another type of pairwise alignment where two sequences are not assumed to be similar over the entire length. It finds the regions with the highest level of similarity between the two sequences. Since the two sequences may be of different sizes, the local alignment aligns the specific sequences of the query with that of the target sequence of the same specific region. It finds a high level of matches between the sequences without considering the alignment of the rest of the sequence regions. It is suitable for aligning more distantly related sequences. It is used for finding the conserved patterns in DNA sequences. BLAST, LALIGN, and EMBOSS water are some examples of tools used during local alignment.
Loading the sequences and calculating the length in R Programming
Loading a sequence as a DNA string:
In R programming the collection of letters is referred to as a “string”. There are different ways to store sequences of letters in R, for example, we can just make a string and call it something like:
String1 <- “Gene Sequence” String2 <- “ATTTCGGTAATACGGT”
Once we store the sequences in the variable then we will be able to perform various operations like Calculating the length of sequences, finding the number of characters of a specific target, or taking the subset of a string.
Code for calculating the length of sequences:
nchar(string1) or str_length(string1)
Converting DNA code to RNA
DNA consists of four nitrogen bases called Adenine (A), Guanine (G), Thymine (T), and Cytosine (C) whereas RNA consists of Uracil (U) instead of thymine and others are all the same. In R, we can use the function ‘chartr’ to convert an old character into a ‘new character’ in a given string.
Code used during conversion of DNA to RNA in R:
string2 <- “ATTACCGGTAATTGCTTGAACGTAAGTTGGACATTAG” RNAstring2 <- chartr(old=”T”, new=”U”, string2) Print(RNAstring2)
Then the output will be AUUACCGGUAAUUGCUUGAACGUAAGUUGGACAUUAG where T is replaced by U.
Convert RNA to Amino acids:
Unlike DNA to RNA conversion, converting RNA to amino acids, we have to look manually the tri-nucleotide combination and convert it into amino acids. But in R we can do it automatically by loading the library called Biostrings which is present in R that automates this type of functionality.
Code for this:
Library (Biostrings) string2 <- “ATTACCGGTAATTGCTTGAACGTAAGTTGGACATTAG” dna1 <- DNA string(string2) translate(dna1)
Pairwise sequence alignment: Aligning and comparing sequences
There are 100s and 1000s of nucleotides present in the same sequences and analyzing them becomes very difficult. In many cases, we even don’t know what that sequence might do and for one to understand the sequence variations is by aligning the sequences.
Codes for alignment:
#load libraries Library(DECIPHER) Library(Biostrings) #load data as a DNA string DNASeq1 <- DNAString(“ATACTGGGTAACCCATGCTCTAGCATGCTA”) DNASeq2 <- DNAString(“ATACTGGCTAACCGAAGCTCTAGCATGCTA”) DNASeq3 <- DNAString(“CATCGGATGCATGATGCTATCG”) #perform pairwise alignment myAlign <- pairwiseAlignment(DNASeq1, DNASeq2) #print aligned sequences print(myAlign)
Comparing the long and short sequences
myAlign <- PairwiseAlignment(DNASeq1, DNASeq3) print(myAlign) #To make an “overlapping” alignment R overlapAlign <- pairwiseAlignment(DNASeq1, DNASeq3, type = “overlap”) Print(overlapAlign)
Multiple Sequence Alignment and R Programming
Multiple sequence Alignment is for the alignment of multiple sequences. It compares three or more biological sequences and is a global multiple sequence alignment. It consists of complicated and sophisticated algorithms. In this kind of alignment, it first aligns the most closely related pair of sequences and then next goes for the most similar one, and so on. It is applicable for the detection of variable regions between the sequences, for Phylogenetic analysis, for detection of homology genes between the new sequences and the existing sequences in the database, and for detecting homology in multiple sequences. Some of the examples of multiple sequence alignment tools used include MUSCLE, CLUSTALW, MAFFT, and T-Coffee.
Multiple sequence alignment can be formed in R without the use of traditional algorithms like MUSCLE, and for this, we can use DECIPHER packages.
Codes for multiple sequence alignment:
#load libraries library(DECIPHER) library(Biostrings) #load sequences DNASeq <- list( ) DNASeq <- DNAString(“ATCGGCTACGTGGCATCAGTCAGTACTCATAC”) DNASeq <- DNAString(“ATCGGCTACGTGTCATCAGTCAGTACAGTTAC”) DNASeq <- DANString(“ATCGGCTACGTCAGTTCAGTCAGTACATTTACCC”) seqs <- DNAStringSet(unlist(DNASeq)) #perform the alignment of multiple sequences aligned <- AlignSeqs(seqs) #print out the aligned sequences print(aligned)
We should use the verbose=FALSE next to the alignment, like this
aligned <- AlignSeqs(seqs, verbose=FALSE)
We can also save the output as a FASTA file using the function
writeXStringSet( ) #Visualize the alignment using BrowseSeqs( ) #view or save the alignment in an HTML file for a browser view TF <- tempfile(“plot__”, fileext = “.html”) TF <- BrowseSeqs(aligned, highlight=0, htmlFile = TF) file.copy(TF,’./’)
These are some of the codes for working with the data in R, which includes loading the sequences, changing the DNA sequences to RNA, and performing translations by defining certain function. Also when it is very difficult to analyze any kind of changes in the sequences then R becomes one of the helplines for arranging and analyzing these sequences. From this visualization becomes more suitable and drawing of conclusion becomes much easier.
- Robert Gentleman. R Programming for Bioinformatics. 2008. Chapman and Hall/CRC.