The goal of this analysis is to solve an unsupervised classification problem on a genomic data. The data consists of shuffled protein sequences of the form
$$\text{GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKF...} $$We will perform a clustering analysis to separate them into several protein families.
We address the problem in five major steps:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import DBSCAN
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from sklearn.preprocessing import StandardScaler
import networkx as nx # to draw graphs
from ripser import Rips # to compute the topological diagrams
from persim import plot_diagrams # to plot the topological diagrams
data=pd.read_csv('/Users/miradain/Documents/Genomic analysis/test_data_scientist_shuffled_sequences.csv')
proteins=data['original_sequence']
print(data.head(5))
print('The data has',len(data), 'sequences.')
We know that some letters in a protein sequence might not refer to amino acids. These are 'B', 'Z', 'X', 'J'. We will check in our code if there are proteins having one of these letters. The code below describes the "unambigous_checker" which is function designed to check unambiguous letters on protein sequences in the data.
#The function returns 0 is there is an unambigous letter, and 1 otherwise.
def unambigous_checker(protein):
Unambiguous_Letters=['B', 'Z', 'X', 'J']# unvalid unambiguous letter for protein
Letters=list(protein)# get the letters from the protein
dic={}
checker=1
for acid in Letters:
if acid in Unambiguous_Letters:
dic[acid]=0
else:
dic[acid]=1
checker= checker*dic[acid]
return checker
# Test on the data
N=len(data)
check=np.zeros(N)
for i in range(N):
check[i]=unambigous_checker(proteins[i])
print('There is exactly ', N-sum(check), 'protein(s) with at least one unambigous letter')
We extract features by analysing different properties of each protein of the data. These are: amino acid frequences per proteins, molecular weight, aromaticity, instability_index, flexibility, isoelectric point, secondary structure fraction, and the protein scaling. We use the methods defined in the ProtParam module of Biopython.
Given any amino acid, we associate a numerical feature which counts its appearance on each protein sequence.
def amino_acid_features(data):
N=len(data)
# Creating the alphabet
Alphabet=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
# Create feature names
Acid_Features=[Alphabet[i]+'_feature' for i in range(len(Alphabet))]
# Initialize the dataframe with amino acid features
matrix=np.zeros((N, 26))
df=pd.DataFrame(matrix, columns=Alphabet)
for i in range(N):
my_seq=data[i]
analysed_seq = ProteinAnalysis(my_seq)
count= analysed_seq.count_amino_acids()# Count the amino acids
Names=[name for name in count] # Get the amino acids present in the protein
for acid in Names:
df[acid][i]=count[acid]
# Renaming dataframe columns
df.columns=Acid_Features
return df
# Test on the data
df_Amino_acid=amino_acid_features(proteins)
print(df_Amino_acid)
This feature contains the molecular weight of each protein. That is obtained by multiplying the subscript of each element (letter) by its atomic weight on the periodic table.
# Get the molecular weight of the protein
def protein_weight_feature(data):
N=len(data)
weight=np.zeros(N)
for i in range(N):
my_seq=data[i]
analysed_seq=ProteinAnalysis(my_seq)
if unambigous_checker(my_seq)==1: # Check is the data has a protein with unambigous letter
weight[i]=analysed_seq.molecular_weight() # get the weight of the protein
return weight
# Test on the data
weight=protein_weight_feature(proteins)
print(pd.DataFrame(weight) )
We counts on each protein sequence the number (in percentage) of three kinds of aromatic amino acids: Phenylalanine, Tryptophan, Tyrosine. This feature is the sum of the collected values.
# Get the aromaticity of the protein
def aromaticity_feature(data):
N=len(data)
aromaticity=np.zeros(N)
for i in range(N):
my_seq=data[i]
analysed_seq=ProteinAnalysis(my_seq)
aromaticity[i]= analysed_seq.aromaticity()
return aromaticity
# Test on the data
aromaticity= aromaticity_feature(proteins)
print(pd.DataFrame(aromaticity))
The instability index provides an estimate of the stability of the protein in a test tube. A protein whose instability index is smaller than 40 is predicted as stable, a value above 40 predicts that the protein may be unstable.
# Get the instability_index of the protein
def instability_index_Feature(data):
N=len(data)
instability_index=np.zeros(N)
for i in range(N):
my_seq=data[i]
analysed_seq=ProteinAnalysis(my_seq)
if unambigous_checker(my_seq)==1:# Check is the data has a protein with unambigous letter
instability_index[i]= analysed_seq.instability_index()
return instability_index
# Test on the data
instability_index=instability_index_Feature(proteins)
print(pd.DataFrame(instability_index))
Flexibility has been predicted from amino acid sequence with a sliding window averaging technique. This feature contains for each protein sequence the mean of different flexibility values.
# Get the flexibility mean of the protein
def flexibility_mean_feature(data):
N=len(data)
flexibility_mean=np.zeros(N)
for i in range(N):
my_seq=data[i]
analysed_seq=ProteinAnalysis(my_seq)
if unambigous_checker(my_seq)==1:# Check is the data has a protein with unambigous letter
flexibility_mean[i]= np.mean(analysed_seq.flexibility())
return flexibility_mean
# Test on the data
flexibility=flexibility_mean_feature(proteins)
print(pd.DataFrame(flexibility))
We create the isoelectric point on a given protein by estimating the pH value at which the protein will have a net charge of zero, and further determine the pKa value right above and right below the estimated pH and find their average. This corresponds to the isoelectric point (pI value) of the protein.
# Get the isoelectric point of the protein
def isoelectric_point_feature(data):
N=len(data)
isoelectric_point=np.zeros(N)
for i in range(N):
my_seq=data[i]
analysed_seq=ProteinAnalysis(my_seq)
isoelectric_point[i]= analysed_seq.isoelectric_point()
return isoelectric_point
# Test on the data
isoelectric_point=isoelectric_point_feature(proteins)
print(pd.DataFrame(isoelectric_point))
This method returns a list of the fraction of amino acids which tend to be in helix, turn or sheet.
- Amino acids in helix: V, I, Y, F, W, L.
- Amino acids in turn: N, P, G, S.
- Amino acids in sheet: E, M, A, L.
We create three features: Helix, Turn and Sheet.
# Get the secondary structure fraction of the protein
def secondary_structure_fraction_feature(data):
N=len(data)
matrix=np.zeros((N, 3))
df=pd.DataFrame(matrix, columns=['Helix', 'Turn', 'Sheet'])
for i in range(N):
my_seq=data[i]
analysed_seq=ProteinAnalysis(my_seq)
triple= analysed_seq.secondary_structure_fraction()
df['Helix'][i]=triple[0]
df['Turn'][i]=triple[1]
df['Sheet'][i]=triple[2]
return df
# Test on the data
df_secondary_structure_fraction=secondary_structure_fraction_feature(proteins)
print(df_secondary_structure_fraction)
The method returns a list of values which can be plotted to view the change along a protein sequence. One can set several parameters that control the computation of a scale profile, such as the window size(=9) and the window edge(=0.4) relative weight value. This feature contains for each protein sequence the mean of different scaling values.
# Get the protein scale mean of the protein
def protein_scale_mean_feature(data):
N=len(data)
protein_scale_mean=np.zeros(N)
for i in range(N):
my_seq=data[i]
analysed_seq=ProteinAnalysis(my_seq)
if unambigous_checker(my_seq)==1:# Check is the data has a protein with unambigous letter
protein_scale_mean[i]= np.mean(analysed_seq.protein_scale(analysed_seq.get_amino_acids_percent(), window=9, edge=0.4))
return protein_scale_mean
# Test on the data
protein_scale=protein_scale_mean_feature(proteins)
print(pd.DataFrame(protein_scale))
New_data=df_Amino_acid.copy()
New_data['weight']=weight
New_data['aromaticity']=aromaticity
New_data['instability']=instability_index
New_data['flexibility']=flexibility
New_data['isoelectric_point']=isoelectric_point
New_data['Helix']=df_secondary_structure_fraction['Helix']
New_data['Turn']=df_secondary_structure_fraction['Turn']
New_data['Sheet']=df_secondary_structure_fraction['Sheet']
New_data['protein_scale']=protein_scale
print('The new data has', New_data.shape[1], 'features.')
In this part, we will remove features with very low variances.
Covariances=New_data.cov() # Computing the covariance matrix
# Creating a dataframe with feature variances
variances=pd.DataFrame(np.diag(Covariances), index=Covariances.index, columns=['variance'])
features=list(variances.index) # Creating a feature list
variances['features']=features # Creating a feature column
sns.swarmplot(x="variance", y="features", data=variances)# Plotting the variances
In this plot, the variable "weight" has the highest variance and his hight value is influencing the visibility of the other features. Therefore, we will drop this feature and plot again the rest of the data.
df=variances.drop(index='weight')# drop the feature "weight"
sns.swarmplot(x="variance", y="features", data=df)
Many features have very low variances (closed to 0). These are: protein_scale, Sheet, Turn, Aromacity, flexibility, some amino acid features such as: U_feature, O_feature, and finaly the obvious unambigous letters: B_feature, Z_feature, X_feature and J_feature.
We decide from this plot to consider the treshold=0.05. In other words, we will drop features with variances less that 0.05.
Var=variances[variances.variance>0.05]
New_features=Var.index.values.tolist()
print('The features to consider in our analysis are:',New_features)
Final_data=New_data[New_features]
In this part, we will group the proteins using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. The idea behind this approach consists of thikening the data points to balls and then explore the density of the point cloud. The main parameters of this algorithm are: the radius $\varepsilon$ of disks (or sphere) around each point and the minimum number of points required by cluster.
To make the DBSCAN algorithm more efficient, we will use a topological trick which consists of varying the radius $\varepsilon.$ This approach is robust to noises and we can observe (through a filtration) when a cluster is created (its birth) and when it merges with another cluster (its death). This will permit to decide (with less bias) the number of cluster to consider.
To implement this topological part, we will use the "ripser" library to compute the diagrams of (birth, death) and the "persim" library to plot the diagrams.
rips=Rips()
scaled_data=StandardScaler().fit_transform(Final_data)
diagrams=rips.fit_transform(scaled_data)
rips.plot(diagrams)
The points in blue represents the clusters and those in orange, that we will not interpret here, represents the holes in the space. The blue points far from the diagonal are clusters which persists the most over filtration. In other words, we are interested by these points and our treshold will be to keep clusters which persist over Dearth=4 (on the plot). There are exactly 4 clusters.
scaled_data=StandardScaler().fit_transform(Final_data)# Scaling the data
db=DBSCAN(eps=4, min_samples=1).fit(scaled_data) # Creating the DBSCAN object and fitting the data
labels=np.array(db.labels_) # Getting the label out of the model
print('The clusters are labeled in the group:',set(labels))
classified_data=data.copy()
classified_data['labels']=labels
print(classified_data)
To plot a protein sequence, we consider each letter (amino acid) as a node. Then when reading the sequence from the left to the right, we create an edge between every consecutive letters(nodes). In that sense, the graph represents the interaction between each amino acid with the others.
def Protein_Network_graph(protein):
G=nx.Graph()# Create the network
Letters=list(protein)# Extract the letters from the protein
# set a color to each letter of the alphabet
color_map = {'A':'silver', 'B':'rosybrown', 'C':'red','D': 'sienna', 'E':'peru', 'F':'gold', 'G':'orange',
'H':'green','I': 'turquoise', 'J':'steelblue','K':'slategray', 'L':'lightsteelblue', 'M':'blue',
'N':'navy', 'O':'plum', 'P':'orchid','Q': 'pink', 'R':'violet', 'S':'salmon','T': 'sienna',
'U':'palegreen','V': 'slategray','W': 'blueviolet', 'X':'ivory', 'Y':'lime', 'Z':'cyan'}
color_letters=[color_map[letter] for letter in set(Letters)] # get the color to each letter of the protein
# adding nodes to the network
Nodes=list(set(Letters))
G.add_nodes_from(Nodes)
# adding edges
for i in range(len(Letters)-1):
edge = (Letters[i], Letters[i+1])
G.add_edge(*edge)
nx.draw(G, node_color=color_letters, with_labels=True )
plt.show() # display
We extract a protein sequence from the first family and plot the corresponding graph.
# Fist group setting
Seq0_data=classified_data[classified_data['labels']==0]
Seq_0=Seq0_data['original_sequence'][0]
# Plot the first group
Protein_Network_graph(Seq_0)
# Graph description
Nodes=set(list(Seq_0))
print('Nodes:',list(Nodes))
print('This graph has',len(list(Nodes)), 'nodes.')
We extract a protein sequence from the second family and plot the corresponding graph.
# Second group setting
Seq1_data=classified_data[classified_data['labels']==1]
Seq_1=Seq1_data['original_sequence'][58]
# Plot the second group
Protein_Network_graph(Seq_1)
# Graph description
Nodes=set(list(Seq_1))
print('Nodes:',list(Nodes))
print('This graph has',len(list(Nodes)), 'nodes.')
We extract a protein sequence from the third family and plot the corresponding graph.
# Third group setting
Seq2_data=classified_data[classified_data['labels']==2]
Seq_2=Seq2_data['original_sequence'][275]
# Plot the third group
Protein_Network_graph(Seq_2)
# Graph description
Nodes=set(list(Seq_2))
print('Nodes:',list(Nodes))
print('This graph has',len(list(Nodes)), 'nodes.')
We extract a protein sequence from the fourth family and plot the corresponding graph.
# Fourth group setting
Seq3_data=classified_data[classified_data['labels']==3]
Seq_3=Seq3_data['original_sequence'][303]
# Plot the fourth group
Protein_Network_graph(Seq_3)
# Graph description
Nodes=set(list(Seq_3))
print('Nodes:',list(Nodes))
print('This graph has',len(list(Nodes)), 'nodes.')
Our analysis suggests that our dataset contains 4 major families of proteins. We have drawn a graph for each member of the family. This gives insight in terms of protein density which refers to the interaction between amino acids. We then see that they all have different densities.
To perform this cluster analysis, we relied on some of the literature on this subject. This made it possible to obtain numerical features. However, we believe that some improvements can still be made in this analysis:
We are not sure that we have included all of the defining features to differentiate proteins. Therefore, as an improvement of this work, one could make a state of the art in order to take into account all the important features necessary in the process of classification of proteins.
One question caught our attention during this work: have we sufficiently exploited the connection between amino acids in the protein sequence? Even though we have used the “Protein scaling” and “Flexibility” features which are based on the sliding window approach, we are not sure that they are sufficient from a generalization perspective on protein classification. One could for example add features (categorical) which take the letter of protein sequence corresponding to each given position.
To decide on the number of clusters, we relied heavily on the topological (spartial) configuration of the points in a Euclidean space. We could further study the space of the protein points that we created (by transformation) and choose a more appropriate metric (for example: the pseudometric which comes from the alignment by pairs of sequences). We believe that the study of such a space is more adequate in a supervised classification context.
To make our codes scalable to a larger dataset, in addition to considering adding more functionality, we believe that we need to evaluate our results with real families of proteins and adjust meta parameters such as: the size of the sliding window on "Protein scaling" and "Flexibility".
[1] Kunchur Guruprasad, B.V.Bhasker Reddy, Madhusudan W. Pandit, Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence, Protein Engineering, Design and Selection, Volume 4, Issue 2, December 1990, Pages 155–161, https://doi.org/10.1093/protein/4.2.155
[2] Lobry JR, Gautier C. Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic Acids Res. 1994;22:3174–3180.
[3] Vihinen M, Torkkila E, Riikonen P. Accuracy of protein flexibility predictions. Proteins. 1994 Jun;19(2):141-149. DOI: 10.1002/prot.340190207.