Bioinformatics is the use of computational tools to generate, collect, analyze, visualize and store information associated with biomolecules.
- Biopython is the set of tools that is available freely written in Python by a team of international developers for biological computations.
- Biopython is the effort of developing python libraries that address the need in Bioinformatics.
There is a huge amount of biological data available in different databases and it becomes a challenging task to handle those data effectively for the visualization of information. Also, a large amount of data is generated because of new scientific innovations and research in different fields. So, arranging the required data and visualizing it to conclude its results is difficult. So it is a must to have some knowledge regarding coding to deal with data efficiently along with accuracy. Individuals dealing with data, especially data related to omics should have the knowledge to read, write, change and optimize code. Again this does not mean writing the code from scratch rather means having the knowledge and using of libraries and packages efficiently and correctly. Most of the biological and biomedical data require programming languages like Python and R.
#import packages import pandas as pd import numpy as np
Here, import is an instruction given for the extraction of defined packages; then after this, we define the packages by their name as pandas and numpy. We are allowed to use any term but pd and np are standard in the python community.
NumPy stands for ‘Numeric python’ or ‘Numerical Python’ which is an open-source module of python that provides fast mathematical computations on arrays and matrices and these arrays and matrices are essential in the Machine Learning ecosystem. Machine Learning modules like scikit-learn, pandas, matplotlib, Tensorflow, etc complete the python machine learning ecosystem. Dimensions are called axes and the number of axes is called the rank in NumPy.
Important attributes of a NumPy object are:
- Ndim: dimensions of the array are displayed.
- Shape: indicating the size of the array, the type of the integers is returned.
- Size: In the NumPy array, the total number of elements is returned.
- Dtype: In the array, the types of the elements are returned.
- Itemsize: The size of each item is returned in bytes.
- Reshape: The NumPy array is reshaped.
NumPy array elements can be accessed using the following indexing
- A[2:5] as the index in NumPy arrays starts from 0, it will print the item from 2 to 4.
- A[2::2] prints the items 2 to end and skips 2 items.
- A[::-1] the arrays are printed in the reverse order.
- A[1:] print the rows till the end beginning from 1.
Another important library is the panda.
Panda is an important and most widely used python library in data science which provides easy-to-use structure and data analysis tools for high performance. It provides in-memory 2d table objects called data frames and with this 2d table panda is capable to create pivot tables, computing columns based on other columns, and plotting graphs.
Panda can be imported in python using the following function
Import panda as pd
pd.series function is used for pandas series object and panda also provide SQL-like functionality for filtering and sorting rows based on conditions.
Dataframes can be easily exported and imported from CSV, Excel, JSON, HTML, and SQL databases. Some of the methods that are essential in dataframes are:
- head( ): top 5 rows are returned back in the dataframe object.
- tail( ): bottom 5 rows are returned in the dataframe.
- info( ): the summary of the dataframe is printed.
- describe( ): statistical summary of the data is provided.
Importing files in Python
In Python for importing any file, it is necessary to specify the file path from the computer. Here, the file path is the sequence of folders or disks where the file is stored. For copying the file path in windows, locate the file and press shift + right-click and select copy as the path, then paste the file path where it is necessary. Similarly, for copying a file path in Mac OS X, select the file or folder in OS X finder then press command + I to browse file information. Navigate “where” to select the path and press command + C to copy the full path to the clipboard and finally paste the file where needed.
Importing and viewing a text file into python
#load data into the object “df” using the pandas “read_table” function df = pd.read_table(‘path/where/the/txtfile/is/located/filename.txt’)
Data can also be loaded directly from the URL, using a link to a raw dataset, while working online in the computer.
#load data into the object “df” using the pandas “read_table” function df = pd.read_table(‘URL link’) df
As the data consists of the number of rows and columns and to separate the data columns the function “sep=’\t’” is used. We can also specify the header with the function header=0.
#load data into the object “df” using the pandas “read_table” function df = pd.read_table(‘path/where/the/txtfile/is/located/filename.txt’, sep=”\t”, header=0)
Importing and viewing excel and .csv files into python
#import an excel spreadsheet using pandas df = pd.read_excel(‘path/where/the/excelfile/is/located/filename.xlsx’) #import a .csv dataset using numpy df = pd.read_csv(‘path/where/the/csvfile/is/located/filename.cv’, sep=’;’) #view data and navigate through data types df df.dtypes
In Python, a function is a piece of code written to carry out a specific task that can accept arguments or parameters. For defining the function a keyword called ‘def’ is used and then the parameters are added to the function and after that statement is added.
#define a function named as seq_length def seq_length(seq): counter = 0 while seq[counter:]: counter += 1 return counter (#returns the sequence length) #execute our function, print length of given sequences seq1 = “ATCGGTCAAT” print(seq_length(seq1))
Data Visualization with Python
Data visualization is one of the important parts of extracting the required information and understanding the trend of data. When the data is collected and presented in the tabular format then it becomes challenging to understand the data properly. So to understand the data or the information conveyed by the data, it is important to visualize it in a picture format which could be graphs, plots, or charts. Hence the way for finding the correlation and trends in our data with the help of pictures is called data visualization. Different python data modules including Matplotlib, Seaborn, Plotly, etc. can be used for the visualization of data in python.
- The visual summary of both small and large data can be obtained from data visualization, but it is mainly used for understanding bulky and large data.
- It is difficult to derive any conclusion about the data by simply looking at the large gene expression tables. Thus, visualization plots help in understanding the differences and reach to a conclusion.
- Before visualization of data, it is important to process data as these data contain a lot of noise.
#read the dataset and clean data from the text row dp = pd.read_table(“URL link”,sep=’\t’,header=(0)) #remove id column to keep only numeric data data = data.drop([‘Id’], axis = 1) #convert integers to floats datafinal = data.astype(float) #perform log transformation using numpy package and show data description log = np.log(datafinal+1) log.describe()
Some of the python packages used in python are as follows:
Matplotlib is a plotting library for plotting static and animated graphs like charts, bars, plots, etc., and mainly works with datasets and arrays. Matplotlib can be used in python scripts and pairs well with pandas and numpy for data analysis as it is customizable.
#importing the library Import matplotlib.pyplot as plt
The Pyplot provides a MATLAB-like interface and is a Matplotlib module. Some of the pyplot functions are:
- Plot(x,y): plot x and y using the default line style and color.
- Plot.axis([xmin, xmax, ymin, ymax]): scales the x-axis and y-axis from minimum to maximum values.
- Plot.(x, y, color=’green’, marker=’0’, linestyle=’dashed’, linewidth=2, markersize= 12): x and y co-ordinates are marked using circular markers of size 1 and green color line with-style of width 2
- Plot.xlabel(‘x-axis’): x-axis names
- Plot.ylabel(‘y-axis’): y-axis names
Importing packages and libraries
#import packages Import pandas as pd Import numpy as np Import matplotlib.pyplot as plt
Read or import the dataset and clean data from the text row
data = pd.read_table(“specific name of file.txt”,sep=’\t’,header=(0)) data = data.drop([‘Id’], axis = 1) #convert integers to floats datafinal = data.astype (float) #perform log transformation using numpy packages and show data description log = np.log(datafinal+1) log.describe() #create the boxplot plt.figure(figsize=(20,16)) plt.boxplot(log) plt.show() #create histogram plt.figure(figsize=(20,16)) plt.hist(log) plt.show() #create heatmap plt.imshow(datafinal, cmap=‘viridis’) plt.colorbar() plt.show() #filter values lower than 3 datafinal = datafinal[datafinal > 3] datafinal
Data visualization is one of the important aspects of AI and Machine learning. Python helps us to get deep into our data through different graphical representations. Data representation can be colorful too.
The data that is present in an unorganized manner can be rearranged and plotted to visualize and conclude the results or the differences between the two different data.