02- 非监督式 - 化合物小分子层次聚类 - 《001-机器学习》

导库
读取和可视化分子
计算分子指纹
HCL clustering

In this chem-workflow, I will show you a strategy to calculate the similarity of a molecule database in a straightforward manner.
For this purpose, I will use a fraction of the REAL Compound Library of Enamine. You can read more about the database here.
Despite I will use a SMILES file, this analysis can be performed from SDF codes of molecules or any other format in the same way. Just be sure of using the appropriate supplier for loading your molecules into RDKIT.

导库

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from rdkit import Chem, DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
from rdkit.Chem import Draw
# All we need for clustering
from scipy.cluster.hierarchy import dendrogram, linkage

读取和可视化分子

The library contains more than 8 000 000 SMILES codes (size 542.8 M). But we can read the .smiles file as a text file to keep the fist 100 molecules.

# The result of this code will be:
working_library=[]
with open('Enamine_REAL_diversity.smiles','r') as file:
    for index,line in enumerate(file):
        if 0<index<=1: # Kee the fist smile code as example
            print (index, line.split())

1 ['CC(C)C(C)(C)SCC=CBr', 'PV-002312932038']

The result of the above cell is a list which first element (0) is the SMILES codes of the molecule, and the second element (1), the name.
So, we can use the same code to convert SMILES codes to RDKIT molecules.

working_library=[]
with open('Enamine_REAL_diversity.smiles','r') as file:
    for index,line in enumerate(file):
        if 0<index<=100: # Molecules we want (0 is omitted because the fist line (0) of file is the header, not SMILES code)
            mol=Chem.MolFromSmiles(line.split()[0]) # Converting SMILES codes into rdkit mol 
            mol.SetProp('_Name',line.split()[1]) # Adding the name for each molecule
            working_library.append(mol)

Now we have a list of 100 RDKIT type molecules to perform our similarity analysis.
Let´s draw them to see what is inside.

Draw.MolsToGridImage(working_library,molsPerRow=10,subImgSize=(150,150),legends=[mol.GetProp('_Name') for mol in working_library])

计算分子指纹

For fingerprint similarity analysis, we first need to get the fingerprints for each molecule.
For such purpose we type:

fps= [FingerprintMols.FingerprintMol(mol) for mol in working_library]

As result we have n fingerprints as n molecules:

print(len(working_library))
print(len(fps))

100
100

And we can get the similarity for each pair of molecules.
For instance mol_1 as reference and mol_2 as target.

DataStructs.FingerprintSimilarity(fps[0],fps[1])

0.3438735177865613

Following the above example, we can construct all vs all analysis to see the similarity within our database (Something similar can be found in another Chem-workflow of this site -click here to see that entry-)

size=len(working_library)
hmap=np.empty(shape=(size,size))
table=pd.DataFrame()
for index, i in enumerate(fps):
    for jndex, j in enumerate(fps):
        similarity=DataStructs.FingerprintSimilarity(i,j)
        hmap[index,jndex]=similarity
        table.loc[working_library[index].GetProp('_Name'),working_library[jndex].GetProp('_Name')]=similarity

Let´s take a look to the similarity values

table.head(10) # just the first 10 values due to our table in 100 x 100

We can cluster our compound by similarity applying a linkage hierarchical clustering (HCL) analysis (Something similar can be found in another Chem-workflow of this site -click here to see that entry-).
I won´t discuss the different methods for HCL (e.g. single, complete, average, etc). Because literature is plenty of such discussions, for example, you can check the skit-learn documentation.

HCL clustering

linked = linkage(hmap,'single')
labelList = [mol.GetProp('_Name') for mol in working_library]

plt.figure(figsize=(8,15))
ax1=plt.subplot()
o=dendrogram(linked,  
            orientation='left',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=True)
ax1.spines['left'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
plt.title('Similarity clustering',fontsize=20,weight='bold')
plt.tick_params ('both',width=2,labelsize=8)
plt.tight_layout()
plt.show()

Now let´s sort our initial values by the HCL analysis.

# This will give us the clusters in order as the last plot
new_data=list(reversed(o['ivl']))
# we create a new table with the order of HCL
hmap_2=np.empty(shape=(size,size))
for index,i in enumerate(new_data):
    for jndex,j in enumerate(new_data):
        hmap_2[index,jndex]=table.loc[i].at[j]

figure= plt.figure(figsize=(30,30))
gs1 = gridspec.GridSpec(2,7)
gs1.update(wspace=0.01)
ax1 = plt.subplot(gs1[0:-1, :2])
dendrogram(linked, orientation='left', distance_sort='descending',show_leaf_counts=True,no_labels=True)
ax1.spines['left'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax2 = plt.subplot(gs1[0:-1,2:6])
f=ax2.imshow (hmap_2, cmap='PRGn_r', interpolation='nearest')
ax2.set_title('Fingerprint Similarity',fontsize=20,weight='bold')
ax2.set_xticks (range(len(new_data)))
ax2.set_yticks (range(len(new_data)))
ax2.set_xticklabels (new_data,rotation=90,size=8)
ax2.set_yticklabels (new_data,size=8)
ax3 = plt.subplot(gs1[0:-1,6:7])
m=plt.colorbar(f,cax=ax3,shrink=0.75,orientation='vertical',spacing='uniform',pad=0.01)
m.set_label ('Fingerprint Similarity')
plt.tick_params ('both',width=2)
plt.plot()

As you can see, our approach was able to identify several clusters based on fingerprint similarity, however. RDKIT can also calculate frequent-used similarity models. For example Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky. The implementation is very easy and can be set by doing:
similarity=DataStructs.FingerprintSimilarity(i,j, metric=DataStructs.DiceSimilarity)
where:
metric - Should be one of the different methods described above.