In this chem-workflow, I will show you a strategy to calculate the similarity of a molecule database in a straightforward manner.
For this purpose, I will use a fraction of the REAL Compound Library of Enamine. You can read more about the database here.
Despite I will use a SMILES file, this analysis can be performed from SDF codes of molecules or any other format in the same way. Just be sure of using the appropriate supplier for loading your molecules into RDKIT.

导库

  1. import os
  2. import pandas as pd
  3. import numpy as np
  4. import matplotlib.pyplot as plt
  5. from matplotlib import gridspec
  6. from rdkit import Chem, DataStructs
  7. from rdkit.Chem.Fingerprints import FingerprintMols
  8. from rdkit.Chem import Draw
  9. # All we need for clustering
  10. from scipy.cluster.hierarchy import dendrogram, linkage

读取和可视化分子

The library contains more than 8 000 000 SMILES codes (size 542.8 M). But we can read the .smiles file as a text file to keep the fist 100 molecules.

  1. # The result of this code will be:
  2. working_library=[]
  3. with open('Enamine_REAL_diversity.smiles','r') as file:
  4. for index,line in enumerate(file):
  5. if 0<index<=1: # Kee the fist smile code as example
  6. print (index, line.split())

1 ['CC(C)C(C)(C)SCC=CBr', 'PV-002312932038']

The result of the above cell is a list which first element (0) is the SMILES codes of the molecule, and the second element (1), the name.
So, we can use the same code to convert SMILES codes to RDKIT molecules.

  1. working_library=[]
  2. with open('Enamine_REAL_diversity.smiles','r') as file:
  3. for index,line in enumerate(file):
  4. if 0<index<=100: # Molecules we want (0 is omitted because the fist line (0) of file is the header, not SMILES code)
  5. mol=Chem.MolFromSmiles(line.split()[0]) # Converting SMILES codes into rdkit mol
  6. mol.SetProp('_Name',line.split()[1]) # Adding the name for each molecule
  7. working_library.append(mol)

Now we have a list of 100 RDKIT type molecules to perform our similarity analysis.
Let´s draw them to see what is inside.

  1. Draw.MolsToGridImage(working_library,molsPerRow=10,subImgSize=(150,150),legends=[mol.GetProp('_Name') for mol in working_library])

image.png

计算分子指纹

For fingerprint similarity analysis, we first need to get the fingerprints for each molecule.
For such purpose we type:

  1. fps= [FingerprintMols.FingerprintMol(mol) for mol in working_library]

As result we have n fingerprints as n molecules:

  1. print(len(working_library))
  2. print(len(fps))

100
100

And we can get the similarity for each pair of molecules.
For instance mol_1 as reference and mol_2 as target.

  1. DataStructs.FingerprintSimilarity(fps[0],fps[1])

0.3438735177865613

Following the above example, we can construct all vs all analysis to see the similarity within our database (Something similar can be found in another Chem-workflow of this site -click here to see that entry-)

  1. size=len(working_library)
  2. hmap=np.empty(shape=(size,size))
  3. table=pd.DataFrame()
  4. for index, i in enumerate(fps):
  5. for jndex, j in enumerate(fps):
  6. similarity=DataStructs.FingerprintSimilarity(i,j)
  7. hmap[index,jndex]=similarity
  8. table.loc[working_library[index].GetProp('_Name'),working_library[jndex].GetProp('_Name')]=similarity

Let´s take a look to the similarity values

  1. table.head(10) # just the first 10 values due to our table in 100 x 100

image.png
We can cluster our compound by similarity applying a linkage hierarchical clustering (HCL) analysis (Something similar can be found in another Chem-workflow of this site -click here to see that entry-).
I won´t discuss the different methods for HCL (e.g. single, complete, average, etc). Because literature is plenty of such discussions, for example, you can check the skit-learn documentation.

HCL clustering

  1. linked = linkage(hmap,'single')
  2. labelList = [mol.GetProp('_Name') for mol in working_library]
  1. plt.figure(figsize=(8,15))
  2. ax1=plt.subplot()
  3. o=dendrogram(linked,
  4. orientation='left',
  5. labels=labelList,
  6. distance_sort='descending',
  7. show_leaf_counts=True)
  8. ax1.spines['left'].set_visible(False)
  9. ax1.spines['top'].set_visible(False)
  10. ax1.spines['right'].set_visible(False)
  11. plt.title('Similarity clustering',fontsize=20,weight='bold')
  12. plt.tick_params ('both',width=2,labelsize=8)
  13. plt.tight_layout()
  14. plt.show()

image.png
Now let´s sort our initial values by the HCL analysis.

  1. # This will give us the clusters in order as the last plot
  2. new_data=list(reversed(o['ivl']))
  3. # we create a new table with the order of HCL
  4. hmap_2=np.empty(shape=(size,size))
  5. for index,i in enumerate(new_data):
  6. for jndex,j in enumerate(new_data):
  7. hmap_2[index,jndex]=table.loc[i].at[j]
  1. figure= plt.figure(figsize=(30,30))
  2. gs1 = gridspec.GridSpec(2,7)
  3. gs1.update(wspace=0.01)
  4. ax1 = plt.subplot(gs1[0:-1, :2])
  5. dendrogram(linked, orientation='left', distance_sort='descending',show_leaf_counts=True,no_labels=True)
  6. ax1.spines['left'].set_visible(False)
  7. ax1.spines['top'].set_visible(False)
  8. ax1.spines['right'].set_visible(False)
  9. ax2 = plt.subplot(gs1[0:-1,2:6])
  10. f=ax2.imshow (hmap_2, cmap='PRGn_r', interpolation='nearest')
  11. ax2.set_title('Fingerprint Similarity',fontsize=20,weight='bold')
  12. ax2.set_xticks (range(len(new_data)))
  13. ax2.set_yticks (range(len(new_data)))
  14. ax2.set_xticklabels (new_data,rotation=90,size=8)
  15. ax2.set_yticklabels (new_data,size=8)
  16. ax3 = plt.subplot(gs1[0:-1,6:7])
  17. m=plt.colorbar(f,cax=ax3,shrink=0.75,orientation='vertical',spacing='uniform',pad=0.01)
  18. m.set_label ('Fingerprint Similarity')
  19. plt.tick_params ('both',width=2)
  20. plt.plot()

image.png
As you can see, our approach was able to identify several clusters based on fingerprint similarity, however. RDKIT can also calculate frequent-used similarity models. For example Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky. The implementation is very easy and can be set by doing:
similarity=DataStructs.FingerprintSimilarity(i,j, metric=DataStructs.DiceSimilarity)
where:
metric - Should be one of the different methods described above.