1. Python Script algorithm

STEP1:

The first script can:

1.K-mer count:

Identify all FASTA files under the current folder, take out every K base, count each K base combination, and calculate the percentage.

Output Base Times Percent three-column matrix in CSV format.

2. Merge different samples:

Further, identify all CSV files generated in 1, and take out the Base and Percent columns.

Merge according to Base and change the column name to sample name and merge into a CSV file with base as the row name and percent as the column name.

3. Usage:

Switch cd to the directory where all fasta files are stored; Execute python kmer.py K
Enter the K value to get a mergeall.csv file. For further analysis in STPE2 R.

  1. import argparse
  2. def kmer(k):
  3. import os
  4. from collections import Counter
  5. import pandas as pd
  6. import globs
  7. FileList=os.listdir()
  8. for File in FileList:
  9. if os.path.splitext(File)[1]=='.fasta':
  10. f = open(File, 'r')
  11. lines = f.readlines()
  12. seq = ''
  13. for i in lines:
  14. if i[0] == '>':
  15. continue
  16. else:
  17. seq += i.strip()
  18. Dnuclic = (seq[start:start+k] for start in range(0, len(seq)-k+1))
  19. counts = Counter(Dnuclic)
  20. counts = dict(counts)
  21. counts = list(counts.items())
  22. data = pd.DataFrame(counts, columns=['Base', 'Times'])
  23. print(data.iloc[:, 1].sum())
  24. percent = data.iloc[:, 1] / (data.iloc[:, 1].sum()) * 100
  25. print(percent)
  26. data["Percent"] = percent
  27. data=data.drop(columns='Times')
  28. print(data)
  29. name = os.path.splitext(File)[0] + '.csv'
  30. data.to_csv(name, encoding='gbk')
  31. f.close()
  32. pa=os.getcwd()
  33. file = glob.glob(os.path.join(pa, "*.csv"))
  34. quanname=file[0].split("/")[-1]
  35. name=os.path.splitext(quanname)[0]
  36. df1=pd.read_csv(file[0],names=['Base',name+'Percent'],index_col=0)
  37. for f in file:
  38. qname=f.split("/")[-1]
  39. oname=os.path.splitext(qname)[0]
  40. df=pd.read_csv(f,names=['Base',name+'Percent'],index_col=0)
  41. df1=pd.merge(df1,df,on='Base')
  42. print(df1)
  43. df1.to_csv("mergeall.csv")
  44. if __name__ == "__main__":
  45. parser = argparse.ArgumentParser()
  46. parser.add_argument("k", type=int, help="import your designed k value here")
  47. args = parser.parse_args()
  48. kmer(args.k)

STEP2:

Script in the second R language can carry out correlation analysis and heat map visualization for CSV containing different samples and different K combinations after combination

library(readr)
library(corrplot)
library(pheatmap)
mergeall <- read_csv("mergeall.csv")
df=as.data.frame(mergeall)
rownames(df)=df[,1]
df2=df[,-1]
M <- cor(df2)
g <- corrplot(M,method = "color")
pheatmap(M,display_numbers=T,fontsize=15) 
pheatmap(df2,show_colnames =T,show_rownames = T, cluster_cols = T,display_numbers=T,fontsize=10)
pheatmap(scale(df2),show_colnames =T,show_rownames = T, cluster_cols = T,display_numbers=T,fontsize=10)

STPE3:

Results analysis

1. Correlation of different samples:

image.png

The higher the correlation, the more similar the genome was.

2. Clustering analysis of different samples K combined heat map:

We can see here:
1. In the genes of human chromosomes, The AAA and TTT sequences are significantly higher than those of other species, which is consistent with the PolyA sites in the human genome.
2. In human chromosome genes, the sequences that appear blue and are less than those of other species are base combinations with more GC: TCG, ACG, CGA, CGC, CCG, GGC, GCC.
image.png

2. R language N-Gram algorithm Package:

library(ngram)
s=readLines("./Desktop/caulobacterNA1000.txt")
s2=splitter(concatenate(s,collapse =""), split.char = TRUE )
ng=ngram(s3,n=3)
M=get.phrasetable(ng)

Output matrix:
image.png