基因集变异分析(GSVA)

方法:

基因集变异分析在传统转录组分析时就已经使用了,针对单细胞数据,其实是一样的。首先通过GSVA函数将细胞-基因表达矩阵转换为细胞-基因集表达矩阵,然后针对该矩阵计算两组或者多组之间的差异,并通过热图展示该差异。

单细胞的分组可能有多个层面,例如不同细胞活着亚类分组,对照组和实验组,不同时间段的样本等等。因此最好先按照不同的分组将表达量矩阵和细胞分组信息提取出来,以便后续分析

注意事项

  1. 做GSVA分析你应该明确使用哪个矩阵?RNA or Integrated?整合矩阵在数据整合过程中可能会将某些生物学事件的差异消除,因此,建议采用RNA矩阵执行GSVA 分析,下面的例子采用的是整合后的矩阵,是因为这里的整合没有过度消除多样本之间的差异,且为了提高计算速度(整合矩阵只有2000个基因)。
  2. GSVA 分析有两种策略,一种是为每个细胞计算ESscore,最后计算ESscore的平均值;另一种方法是先计算平均表达,再对平均表达计算ESscore。你应该始终执行两种方法,选择最符合生物学意义的结果。
  3. 想要实现GSVA富集后的矩阵做差异分析,需要为每个cluster内的每个细胞计算ESscore,因此面临计算速度缓慢的困难。

需要的R包:

  • GSEAbase
  • GSVA
  • limma
  • Seurat
  • Biobase
  • genefilter
  • dplyr
  • pystr (git)
  • purrr
  • tidyr
  • tibble

提取不同分组的表达矩阵:

相同细胞分类下不同时间样本的表达矩阵和细胞分组

  1. library(Seurat)
  2. library(dplyr)
  3. ##我这里有7个cluster,5个时间阶段
  4. for (i in 1:7){
  5. name<-paste0("./cluster_",i,"/cluster_",i,".afterclu.rds")
  6. endo<-readRDS(name)
  7. time<-endo@meta.data$case
  8. time[which(grepl("^pf_",time))]<-"PF"
  9. time[which(grepl("^po3_",time))]<-"PO3"
  10. time[which(grepl("^po7_",time))]<-"PO7"
  11. time[which(grepl("^po11_",time))]<-"PO11"
  12. time[which(grepl("^rif_",time))]<-"RIF"
  13. endo@meta.data$time<-time
  14. endo[['Cell.type']]<-endo[['seurat_clusters']]
  15. endo2<-subset(endo,subset=time%in%c("PF","PO3","PO7","PO11"))
  16. for (cls in unique(endo2$Cell.type)){
  17. exam<-subset(endo2,idents=cls)
  18. DefaultAssay(exam)<-"integrated"
  19. exprMat<-GetAssayData(exam,slot="data")
  20. exprMat<-as.matrix(exprMat)
  21. cellInfo<-exam@meta.data%>%select(Cell.type,time)
  22. write.table(exprMat,paste0("./cluster_",i,"/cluster_",i,"_",cls,".exprMat_.txt"),sep="\t",quote=FALSE)
  23. write.table(cellInfo,paste0("./cluster_",i,"/cluster_",i,"_",cls,".cellInfo_.txt"),sep="\t",quote=FALSE)
  24. }}

相同时间不同细胞的矩阵和信息

  1. library(Seurat)
  2. library(dplyr)
  3. for (i in 1:7){
  4. name<-paste0("./cluster_",i,"/cluster_",i,".afterclu.rds")
  5. endo<-readRDS(name)
  6. time<-endo@meta.data$case
  7. time[which(grepl("^pf_",time))]<-"PF"
  8. time[which(grepl("^po3_",time))]<-"PO3"
  9. time[which(grepl("^po7_",time))]<-"PO7"
  10. time[which(grepl("^po11_",time))]<-"PO11"
  11. time[which(grepl("^rif_",time))]<-"RIF"
  12. endo@meta.data$time<-time
  13. endo[['Cell.type']]<-endo[['seurat_clusters']]
  14. for (ti in c("PF","PO3","PO7","PO11")){
  15. exam<-subset(endo,subset=time==ti)
  16. DefaultAssay(exam)<-"integrated"
  17. exprMat<-GetAssayData(exam,slot="data")
  18. exprMat<-as.matrix(exprMat)
  19. cellInfo<-exam@meta.data%>%select(Cell.type,time)
  20. write.table(exprMat,paste0("./cluster_",i,"/cluster_",i,".exprMat_",ti,".txt"),sep="\t",quote=FALSE)
  21. write.table(cellInfo,paste0("./cluster_",i,"/cluster_",i,".cellInfo_",ti,".txt"),sep="\t",quote=FALSE)
  22. }
  23. }

去GSEA官网下载基因集合gmt文件

GSEA Download

这里有多个基因集,具体的介绍参考:
https://www.gsea-msigdb.org/gsea/msigdb/index.jsp

整合流程完成GSEA分析和差异分析

一些注意事项:

接着,做GSVA分析,获取GSVA矩阵,做GSVA分析可以参考传统bulk-RNA分析的方法

这里还需要注意几个问题:

  • 处理普通RNA数据需要预先过滤,但是单细胞数据取自Seurat对象,已经预先过滤好了
  • 如果输入是原始counts值,需要设置参数kcdf="Possion",但如果是TPM值,默认就好,因为我们输入是标准化后的数据,所以用默认参数
  • 默认参数mx.diff=TURE,结果是一个类似于负二项分布,因为后面要做差异分析,所以需要使用该参数,如果设置mx.diff=FALSE,则为高斯分布

先计算相同时间不同细胞的基因集差异

  1. library(GSEABase)
  2. library(Biobase)
  3. library(genefilter)
  4. library(limma)
  5. library(RColorBrewer)
  6. library(GSVA)
  7. library(dplyr)
  8. library(pystr)
  9. ##基因集这里看你用哪个了
  10. geneSets <- getGmt("/dellfsqd2/ST_LBI/USER/panxiaoguang/SingleCell/liangxue_data/cluster_1/GSVA/DataBase/c7.all.v7.0.symbols.gmt")
  11. #geneSets <- getGmt("/dellfsqd2/ST_LBI/USER/panxiaoguang/SingleCell/liangxue_data/cluster_1/GSVA/DataBase/c3.tft.v7.0.symbols.gmt")
  12. ###因为分析的是亚类,这里是针对细胞大类,意思是这
  13. ###里针对的是细胞大类1的所有亚类进行的分析
  14. clus<-1
  15. ###定义不同时间点
  16. time<-c("PF","PO3","PO7","PO11")
  17. ###输入一个列表或向量,进行重新编组
  18. re_grp<-function(x,s){
  19. purrr::map_chr(x,function(haha){if (haha==s){"A"}else{"B"}})
  20. }
  21. ##计算差异并返回差异数据框,x=分组信息,cellInfo=细胞分类信息(必须是数据框,
  22. ##行名为细胞,列名为Cell.type,里面是分组信息),res_es=GSVA表达矩阵,
  23. ##num=想要留下top-n个基因集
  24. cal_Diff<-function(x,cellInfo,res_es,num){
  25. need<-cellInfo%>%transmute(new_group=re_grp(Cell.type,x))%>%.$new_group
  26. grouP<-as.factor(need)
  27. desigN<-model.matrix(~grouP+0)
  28. rownames(desigN)<-rownames(cellInfo)
  29. comparE<-makeContrasts(grouPA-grouPB,levels=desigN)
  30. fiT <- lmFit(res_es, desigN)
  31. fiT2 <- contrasts.fit(fiT, comparE)
  32. fiT3 <- eBayes(fiT2)
  33. Diff<-topTable(fiT3,p.value=0.05,coef=1,num=num)
  34. if (nrow(Diff)>0){
  35. Diff$cluster<-x
  36. Diff<-Diff%>%tibble::rownames_to_column(var="Name")%>%tbl_df()%>%select(Name,logFC,t,P.Value,adj.P.Val,cluster)
  37. }else{
  38. Diff<-tibble::tibble(Name=NA,logFC=NA,t=NA,P.Value=NA,adj.P.Val=NA,cluster=NA)
  39. }
  40. }
  41. ###计算不同分组的平均表达量,mat=表达量矩阵,info=cellInfo
  42. Average_express<-function(mat,info){
  43. new_res<-mat%>%tibble::rownames_to_column(var="Name")%>%tidyr::gather(cell,value,-Name)
  44. new_cellinfo<-info
  45. new_cellinfo<-tibble::rownames_to_column(new_cellinfo,var="cell")
  46. all<-left_join(new_res,new_cellinfo,by="cell")
  47. fuck<-all%>%group_by(Name,Cell.type)%>%summarise(Mean=mean(value))%>%tidyr::spread(Cell.type,Mean)
  48. fuck
  49. }
  50. for(i in time){
  51. expMat<- read.table(pystr_format("cluster_{1}.exprMat_{2}.txt",clus,i),sep="\t",header=TRUE,row.names=1,check.names=FALSE,stringsAsFactors=FALSE)
  52. cellInfo<-read.table(pystr_format("cluster_{1}.cellInfo_{2}.txt",clus,i),sep="\t",header=TRUE,row.names=1,check.names=FALSE,stringsAsFactors=FALSE)
  53. mydata<-as.matrix(expMat)
  54. all_cell<-unique(cellInfo$Cell.type)
  55. res_es <- gsva(mydata, geneSets, min.sz=10, max.sz=500, verbose=FALSE, parallel.sz=8)
  56. res_es<-as.data.frame(res_es)
  57. write.table(res_es,pystr_format("./cell_diff/cluster_{1}_{2}_c7.txt",clus,i),sep="\t",quote=FALSE)
  58. fin<-do.call('rbind',lapply(all_cell,cal_Diff,cellInfo=cellInfo,res_es=res_es,num=10))
  59. fin<-fin%>%filter(!is.na(cluster))
  60. write.table(fin,pystr_format("./cell_diff/cluster_{1}_{2}_c7_diff_geneset.txt",clus,i),sep="\t",quote=FALSE,row.names=FALSE)
  61. heihei2<-Average_express(res_es,cellInfo)
  62. heihei_need<-filter(heihei2,Name%in%unique(fin$Name))
  63. if (nrow(heihei_need)>0){
  64. houhou2<-tibble::column_to_rownames(heihei_need,var="Name")
  65. pheatmap::pheatmap(houhou2,file=pystr_format("./cell_diff/cluster_{1}_{2}_c7_diff_heat.pdf",clus,i),fontsize=8)
  66. }else{
  67. next
  68. }
  69. }

再计算相同细胞不同时间的差异基因集

  1. library(GSEABase)
  2. library(Biobase)
  3. library(genefilter)
  4. library(limma)
  5. library(RColorBrewer)
  6. library(GSVA)
  7. library(dplyr)
  8. library(pystr)
  9. geneSets <- getGmt("/dellfsqd2/ST_LBI/USER/panxiaoguang/SingleCell/liangxue_data/cluster_1/GSVA/DataBase/c7.all.v7.0.symbols.gmt")
  10. #geneSets <- getGmt("/dellfsqd2/ST_LBI/USER/panxiaoguang/SingleCell/liangxue_data/cluster_1/GSVA/DataBase/c3.tft.v7.0.symbols.gmt")
  11. ##clus是细胞亚类分型,这个不能多写
  12. clus<-c(0,1,2,3)
  13. ##细胞大类的名字
  14. big_cls<-1
  15. re_grp<-function(x,s){
  16. purrr::map_chr(x,function(haha){if (haha==s){"A"}else{"B"}})
  17. }
  18. cal_Diff<-function(x,cellInfo,res_es,num){
  19. need<-cellInfo%>%transmute(new_group=re_grp(Cell.type,x))%>%.$new_group
  20. grouP<-as.factor(need)
  21. desigN<-model.matrix(~grouP+0)
  22. rownames(desigN)<-rownames(cellInfo)
  23. comparE<-makeContrasts(grouPA-grouPB,levels=desigN)
  24. fiT <- lmFit(res_es, desigN)
  25. fiT2 <- contrasts.fit(fiT, comparE)
  26. fiT3 <- eBayes(fiT2)
  27. Diff<-topTable(fiT3,p.value=0.05,coef=1,num=num)
  28. if (nrow(Diff)>0){
  29. Diff$cluster<-x
  30. Diff<-Diff%>%tibble::rownames_to_column(var="Name")%>%tbl_df()%>%select(Name,logFC,t,P.Value,adj.P.Val,cluster)
  31. }else{
  32. Diff<-tibble::tibble(Name=NA,logFC=NA,t=NA,P.Value=NA,adj.P.Val=NA,cluster=NA)
  33. }
  34. }
  35. Average_express<-function(mat,info){
  36. new_res<-mat%>%tibble::rownames_to_column(var="Name")%>%tidyr::gather(cell,value,-Name)
  37. new_cellinfo<-info
  38. new_cellinfo<-tibble::rownames_to_column(new_cellinfo,var="cell")
  39. all<-left_join(new_res,new_cellinfo,by="cell")
  40. fuck<-all%>%group_by(Name,Cell.type)%>%summarise(Mean=mean(value))%>%tidyr::spread(Cell.type,Mean)
  41. fuck
  42. }
  43. for(i in clus){
  44. expMat<- read.table(pystr_format("cluster_{1}_{2}.exprMat_.txt",big_cls,i),sep="\t",header=TRUE,row.names=1,check.names=FALSE,stringsAsFactors=FALSE)
  45. cellInfo<-read.table(pystr_format("cluster_{1}_{2}.cellInfo_.txt",big_cls,i),sep="\t",header=TRUE,row.names=1,check.names=FALSE,stringsAsFactors=FALSE)
  46. colnames(cellInfo)<-c("haha","Cell.type")
  47. all_time<-unique(cellInfo$Cell.type)
  48. mydata<-as.matrix(expMat)
  49. res_es <- gsva(mydata, geneSets, min.sz=10, max.sz=500, verbose=FALSE, parallel.sz=8)
  50. res_es<-as.data.frame(res_es)
  51. write.table(res_es,pystr_format("./time_diff/cluster_{1}_{2}_c7.txt",big_cls,i),sep="\t",quote=FALSE)
  52. fin<-do.call('rbind',lapply(all_time,cal_Diff,cellInfo=cellInfo,res_es=res_es,num=10))
  53. fin<-fin%>%filter(!is.na(cluster))
  54. write.table(fin,pystr_format("./time_diff/cluster_{1}_{2}_c7_diff_geneset.txt",big_cls,i),sep="\t",quote=FALSE,row.names=FALSE)
  55. heihei2<-Average_express(res_es,cellInfo)
  56. heihei_need<-filter(heihei2,Name%in%unique(fin$Name))
  57. if (nrow(heihei_need)>0){
  58. houhou2<-tibble::column_to_rownames(heihei_need,var="Name")
  59. pheatmap::pheatmap(houhou2,file=pystr_format("./time_diff/cluster_{1}_{2}_c7_diff_heat.pdf",big_cls,i),fontsize=8)
  60. }else{
  61. next
  62. }
  63. }

经过以上两轮的分析,你将每个层面得到三个结果

  • GSVA富集分数矩阵,正态分布
  • 每次比对的差异Top10基因集
  • 一个多组差异热图

转录调控分析(SCENIC)

按照SCENIC的官方教程,跑完整个流程(这里耗时很长,所以不能直接跑,要提交作业或者screen,避免中间间断
SCENIC 基因互作数据库需要提前下载,人类的一共两个,每个1.2G
hg19-500bp-upstream-7species.mc9nr.feather
hg19-tss-centered-10kb-7species.mc9nr.feather

由于R-Genie计算网络需要随机森林,所以耗时非常长,一个6K细胞的矩阵大概需要数天,所以,我们分三步计算

  • 提取表达量矩阵和细胞分组信息
  • 计算SCENIC网络
  • 计算差异并输出最后结果

需要的R包

  • SCENIC
  • AUCell

提取表达矩阵

SCENIC 把整个矩阵一起打分,避免多次计算网络,消耗时间

  1. library(Seurat)
  2. library(dplyr)
  3. for (i in 1:7){
  4. name<-paste0("./cluster_",i,"/cluster_",i,".afterclu.rds")
  5. endo<-readRDS(name)
  6. time<-endo@meta.data$case
  7. time[which(grepl("^pf_",time))]<-"PF"
  8. time[which(grepl("^po3_",time))]<-"PO3"
  9. time[which(grepl("^po7_",time))]<-"PO7"
  10. time[which(grepl("^po11_",time))]<-"PO11"
  11. time[which(grepl("^rif_",time))]<-"RIF"
  12. endo@meta.data$time<-time
  13. endo[['Cell.type']]<-endo[['seurat_clusters']]
  14. exam<-subset(endo,subset=time%in%c("PF","PO3","PO7","PO11"))
  15. DefaultAssay(exam)<-"integrated"
  16. cellInfo<-exam@meta.data%>%select(Cell.type,time)
  17. exprMat<-GetAssayData(exam,slot="data")
  18. exprMat<-as.matrix(exprMat)
  19. write.table(exprMat,paste0("cluster_",i,".exprMat_noRIF.txt"),sep="\t",quote=FALSE)
  20. write.table(cellInfo,paste0("./cluster_",i,"/cluster_",i,".cellInfo_noRIF.txt"),sep="\t",quote=FALSE)
  21. }

计算TF网络

  1. library(SCENIC)
  2. library(AUCell)
  3. library(pystr)
  4. ##填写不同的clus,对不同细胞进行分析
  5. clus<-1
  6. exprMat<-read.table(pystr_format("cluster_{1}.exprMat_noPO7.txt",clus),sep="\t",header=TRUE,row.names=1,check.names=FALSE,stringsAsFactors=FALSE)
  7. exprMat<-as.matrix(exprMat)
  8. cellInfo<-read.table(pystr_format("cluster_{1}.cellInfo_noPO7.txt",clus),sep="\t",header=TRUE,row.names=1,check.names=FALSE,stringsAsFactors=FALSE)
  9. cellInfo<-dplyr::select(cellInfo,Cell.type)
  10. colnames(cellInfo)<-"CellType"
  11. dir.create("int")
  12. saveRDS(cellInfo, file="int/cellInfo.Rds")
  13. org="hgnc"
  14. ##这个是你自己下载数据库存放的位置
  15. dbDir="/dellfsqd2/ST_LBI/USER/panxiaoguang/SingleCell/SCENIC/cisTarget_databases"
  16. myDatasetTitle="cluster_1_SCENIC_analysis"
  17. data(defaultDbNames)
  18. dbs <- defaultDbNames[[org]]
  19. scenicOptions <- initializeScenic(org=org, dbDir=dbDir, dbs=dbs, datasetTitle=myDatasetTitle, nCores=4)
  20. scenicOptions@inputDatasetInfo$cellInfo <- "int/cellInfo.Rds"
  21. saveRDS(scenicOptions, file="int/scenicOptions.Rds")
  22. exprMat_filtered <- exprMat
  23. runCorrelation(exprMat_filtered, scenicOptions)
  24. runGenie3(exprMat_filtered, scenicOptions)
  25. scenicOptions@settings$verbose <- TRUE
  26. scenicOptions@settings$nCores <- 4
  27. scenicOptions@settings$seed <- 123
  28. runSCENIC_1_coexNetwork2modules(scenicOptions)
  29. runSCENIC_2_createRegulons(scenicOptions)
  30. runSCENIC_3_scoreCells(scenicOptions, exprMat)

批量计算针对每个细胞类型的不同分组的差异TF

  1. library(dplyr)
  2. library(SCENIC)
  3. library(AUCell)
  4. library(pystr)
  5. library(limma)
  6. re_grp<-function(x,s){
  7. purrr::map_chr(x,function(haha){if (haha==s){"A"}else{"B"}})
  8. }
  9. cal_Diff<-function(x,cellInfo,res_es,num){
  10. need<-cellInfo%>%transmute(new_group=re_grp(Cell.type,x))%>%.$new_group
  11. grouP<-as.factor(need)
  12. desigN<-model.matrix(~grouP+0)
  13. rownames(desigN)<-rownames(cellInfo)
  14. comparE<-makeContrasts(grouPA-grouPB,levels=desigN)
  15. fiT <- lmFit(res_es, desigN)
  16. fiT2 <- contrasts.fit(fiT, comparE)
  17. fiT3 <- eBayes(fiT2)
  18. Diff<-topTable(fiT3,p.value=0.05,coef=1,num=num)
  19. if (nrow(Diff)>0){
  20. Diff$cluster<-x
  21. Diff<-Diff%>%tibble::rownames_to_column(var="Name")%>%tbl_df()%>%select(Name,logFC,t,P.Value,adj.P.Val,cluster)
  22. }else{
  23. Diff<-tibble::tibble(Name=NA,logFC=NA,t=NA,P.Value=NA,adj.P.Val=NA,cluster=NA)
  24. }
  25. }
  26. Average_express<-function(mat,info){
  27. new_res<-mat%>%tibble::rownames_to_column(var="Name")%>%tidyr::gather(cell,value,-Name)
  28. new_cellinfo<-info
  29. new_cellinfo<-tibble::rownames_to_column(new_cellinfo,var="cell")
  30. all<-left_join(new_res,new_cellinfo,by="cell")
  31. fuck<-all%>%group_by(Name,Cell.type)%>%summarise(Mean=mean(value))%>%tidyr::spread(Cell.type,Mean)
  32. fuck
  33. }
  34. ###不同的clus针对不同的细胞类
  35. clus<-1
  36. scenicOptions <- readRDS("int/scenicOptions.Rds")
  37. regulons <- loadInt(scenicOptions, "regulons")
  38. regulons2 <- loadInt(scenicOptions, "aucell_regulons")
  39. regulonAUC <- loadInt(scenicOptions, "aucell_regulonAUC")
  40. regulonAUC <- regulonAUC[onlyNonDuplicatedExtended(rownames(regulonAUC)),]
  41. cellInfo<-read.table(pystr_format("cluster_{1}.cellInfo_noPO7.txt",clus),sep="\t",header=TRUE,row.names=1,check.names=FALSE,stringsAsFactors=FALSE)
  42. for (i in c("PF","PO3","PO7","PO11")){
  43. cellInfo2<-cellInfo[which(cellInfo$time==i),]
  44. exp<-as.data.frame(getAUC(regulonAUC))[,rownames(cellInfo2)]
  45. exp2<-t(scale(t(exp), center = T, scale=T))
  46. fin<-do.call('rbind',lapply(unique(cellInfo2$Cell.type),cal_Diff,cellInfo=cellInfo2,res_es=exp2,num=10))
  47. fin<-fin%>%filter(!is.na(cluster))
  48. write.table(fin,pystr_format("cluster_{1}_{2}_TF_diff_geneset.txt",clus,i),sep="\t",quote=FALSE,row.names=FALSE)
  49. haha<-Average_express(exp,cellInfo2)%>%tibble::column_to_rownames(var="Name")%>%as.data.frame()
  50. haha<-haha[unique(fin$Name),]
  51. if (nrow(haha)>0){
  52. haha_scale<-t(scale(t(haha), center = T, scale=T))
  53. pheatmap::pheatmap(haha_scale,file=pystr_format("cluster_{1}_{2}_TF_heatmap.pdf",clus,i))
  54. }else{
  55. next
  56. }
  57. all_tf<-t(purrr::map_dfr(regulons2[unique(fin$Name)],function(x) {stringr::str_c(x,collapse=",")}))
  58. write.table(all_tf,pystr_format("cluster_{1}_{2}_TF_duiying.txt",clus,i),sep="\t",quote=FALSE,col.names=FALSE)
  59. }
  60. for (i in unique(cellInfo$Cell.type)){
  61. cellInfo2<-cellInfo[which(cellInfo$Cell.type==i),]
  62. colnames(cellInfo2)<-c("haha","Cell.type")
  63. exp<-as.data.frame(getAUC(regulonAUC))[,rownames(cellInfo2)]
  64. exp2<-t(scale(t(exp), center = T, scale=T))
  65. fin<-do.call('rbind',lapply(unique(cellInfo2$Cell.type),cal_Diff,cellInfo=cellInfo2,res_es=exp2,num=10))
  66. fin<-fin%>%filter(!is.na(cluster))
  67. write.table(fin,pystr_format("cluster_{1}_{2}_TF_diff_geneset.txt",clus,i),sep="\t",quote=FALSE,row.names=FALSE)
  68. haha<-Average_express(exp,cellInfo2)%>%tibble::column_to_rownames(var="Name")%>%as.data.frame()
  69. haha<-haha[unique(fin$Name),]
  70. if (nrow(haha)>0){
  71. haha_scale<-t(scale(t(haha), center = T, scale=T))
  72. pheatmap::pheatmap(haha_scale,file=pystr_format("cluster_{1}_{2}_TF_heatmap.pdf",clus,i))
  73. }else{
  74. next
  75. }
  76. all_tf<-t(purrr::map_dfr(regulons2[unique(fin$Name)],function(x) {stringr::str_c(x,collapse=",")}))
  77. write.table(all_tf,pystr_format("cluster_{1}_{2}_TF_duiying.txt",clus,i),sep="\t",quote=FALSE,col.names=FALSE)
  78. }

经过以上分析后,同样会得到三个文件:

  • 一个不同组间差异的top10转录因子基因集合
  • 一个不同分组的差异热图
  • 一个TF对应的所有基因表格

批量计算针对每个细胞类型的不同分组的差异TF

  1. library(dplyr)
  2. library(SCENIC)
  3. library(AUCell)
  4. library(pystr)
  5. library(limma)
  6. re_grp<-function(x,s){
  7. purrr::map_chr(x,function(haha){if (haha==s){"A"}else{"B"}})
  8. }
  9. cal_Diff<-function(x,cellInfo,res_es,num){
  10. need<-cellInfo%>%transmute(new_group=re_grp(Cell.type,x))%>%.$new_group
  11. grouP<-as.factor(need)
  12. desigN<-model.matrix(~grouP+0)
  13. rownames(desigN)<-rownames(cellInfo)
  14. comparE<-makeContrasts(grouPA-grouPB,levels=desigN)
  15. fiT <- lmFit(res_es, desigN)
  16. fiT2 <- contrasts.fit(fiT, comparE)
  17. fiT3 <- eBayes(fiT2)
  18. Diff<-topTable(fiT3,p.value=0.05,coef=1,num=num)
  19. if (nrow(Diff)>0){
  20. Diff$cluster<-x
  21. Diff<-Diff%>%tibble::rownames_to_column(var="Name")%>%tbl_df()%>%select(Name,logFC,t,P.Value,adj.P.Val,cluster)
  22. }else{
  23. Diff<-tibble::tibble(Name=NA,logFC=NA,t=NA,P.Value=NA,adj.P.Val=NA,cluster=NA)
  24. }
  25. }
  26. Average_express<-function(mat,info){
  27. new_res<-mat%>%tibble::rownames_to_column(var="Name")%>%tidyr::gather(cell,value,-Name)
  28. new_cellinfo<-info
  29. new_cellinfo<-tibble::rownames_to_column(new_cellinfo,var="cell")
  30. all<-left_join(new_res,new_cellinfo,by="cell")
  31. fuck<-all%>%group_by(Name,Cell.type)%>%summarise(Mean=mean(value))%>%tidyr::spread(Cell.type,Mean)
  32. fuck
  33. }
  34. ###不同的clus针对不同的细胞类
  35. clus<-1
  36. scenicOptions <- readRDS("int/scenicOptions.Rds")
  37. regulons <- loadInt(scenicOptions, "regulons")
  38. regulons2 <- loadInt(scenicOptions, "aucell_regulons")
  39. regulonAUC <- loadInt(scenicOptions, "aucell_regulonAUC")
  40. regulonAUC <- regulonAUC[onlyNonDuplicatedExtended(rownames(regulonAUC)),]
  41. cellInfo<-read.table(pystr_format("cluster_{1}.cellInfo_noPO7.txt",clus),sep="\t",header=TRUE,row.names=1,check.names=FALSE,stringsAsFactors=FALSE)
  42. for (i in c("PF","PO3","PO7","PO11")){
  43. cellInfo2<-cellInfo[which(cellInfo$time==i),]
  44. exp<-as.data.frame(getAUC(regulonAUC))[,rownames(cellInfo2)]
  45. exp2<-t(scale(t(exp), center = T, scale=T))
  46. fin<-do.call('rbind',lapply(unique(cellInfo2$Cell.type),cal_Diff,cellInfo=cellInfo2,res_es=exp2,num=10))
  47. fin<-fin%>%filter(!is.na(cluster))
  48. write.table(fin,pystr_format("cluster_{1}_{2}_TF_diff_geneset.txt",clus,i),sep="\t",quote=FALSE,row.names=FALSE)
  49. haha<-Average_express(exp,cellInfo2)%>%tibble::column_to_rownames(var="Name")%>%as.data.frame()
  50. haha<-haha[unique(fin$Name),]
  51. if (nrow(haha)>0){
  52. haha_scale<-t(scale(t(haha), center = T, scale=T))
  53. pheatmap::pheatmap(haha_scale,file=pystr_format("cluster_{1}_{2}_TF_heatmap.pdf",clus,i))
  54. }else{
  55. next
  56. }
  57. all_tf<-t(purrr::map_dfr(regulons2[unique(fin$Name)],function(x) {stringr::str_c(x,collapse=",")}))
  58. write.table(all_tf,pystr_format("cluster_{1}_{2}_TF_duiying.txt",clus,i),sep="\t",quote=FALSE,col.names=FALSE)
  59. }
  60. for (i in unique(cellInfo$Cell.type)){
  61. cellInfo2<-cellInfo[which(cellInfo$Cell.type==i),]
  62. colnames(cellInfo2)<-c("haha","Cell.type")
  63. exp<-as.data.frame(getAUC(regulonAUC))[,rownames(cellInfo2)]
  64. exp2<-t(scale(t(exp), center = T, scale=T))
  65. fin<-do.call('rbind',lapply(unique(cellInfo2$Cell.type),cal_Diff,cellInfo=cellInfo2,res_es=exp2,num=10))
  66. fin<-fin%>%filter(!is.na(cluster))
  67. write.table(fin,pystr_format("cluster_{1}_{2}_TF_diff_geneset.txt",clus,i),sep="\t",quote=FALSE,row.names=FALSE)
  68. haha<-Average_express(exp,cellInfo2)%>%tibble::column_to_rownames(var="Name")%>%as.data.frame()
  69. haha<-haha[unique(fin$Name),]
  70. if (nrow(haha)>0){
  71. haha_scale<-t(scale(t(haha), center = T, scale=T))
  72. pheatmap::pheatmap(haha_scale,file=pystr_format("cluster_{1}_{2}_TF_heatmap.pdf",clus,i))
  73. }else{
  74. next
  75. }
  76. all_tf<-t(purrr::map_dfr(regulons2[unique(fin$Name)],function(x) {stringr::str_c(x,collapse=",")}))
  77. write.table(all_tf,pystr_format("cluster_{1}_{2}_TF_duiying.txt",clus,i),sep="\t",quote=FALSE,col.names=FALSE)
  78. }

经过以上分析后,同样会得到三个文件:

  • 一个不同组间差异的top10转录因子基因集合
  • 一个不同分组的差异热图
  • 一个TF对应的所有基因表格

批量对差异的TF集合做GO注释

  1. library(readr)
  2. library(dplyr)
  3. library(clusterProfiler)
  4. library(pystr)
  5. library(org.Hs.eg.db)
  6. TF_annotation <- function(x) {
  7. data <- read_delim(x, delim = "\t", col_names = FALSE)
  8. data <- dplyr::rename(data, TF = X1, Genes = X2)
  9. test <-
  10. data %>% tibble::column_to_rownames(var = "TF") %>% as.data.frame() %>% apply(., 1, function(x)
  11. strsplit(x, ","))
  12. test <- lapply(test, function(x)
  13. x$Genes)
  14. new_test <-
  15. lapply(test, function(x) {
  16. bitr(x,
  17. fromType = "SYMBOL",
  18. toType = "ENTREZID",
  19. OrgDb = org.Hs.eg.db)$ENTREZID
  20. })
  21. wocao <-
  22. compareCluster(
  23. geneClusters = new_test,
  24. fun = "enrichGO",
  25. OrgDb = org.Hs.eg.db,
  26. ont = "BP",
  27. pAdjustMethod = "BH",
  28. pvalueCutoff = 0.01,
  29. qvalueCutoff = 0.05
  30. )
  31. name <- stringr::str_replace(x, ".txt", "")
  32. write.table(
  33. wocao@compareClusterResult,
  34. pystr_format("{1}.annotation.txt", name),
  35. row.names = FALSE,
  36. quote = FALSE,
  37. sep = "\t"
  38. )
  39. }
  40. fs<-Sys.glob("*_duiying.txt")
  41. lapply(fs,TF_annotation)