单细胞基础分析

image.png
image.png
左边芯片数据,右边TCGA转录组 数据
image.png下载方式
找了多个 芯片GEO数据找到了多少个上下调基因。一般 文章 中用的padjust比较多
TCGA使用了 这个癌症数据
然后 差异基因取交集82个上下调基因
然后GO/KEGG/PPI
然后 hub基因的KM-PLOT和ROC曲线
最后单细胞验证hub基因,PCR和TIMER验证
六个数据的PCA图
image.png

放宽标准,6个 数据有2个数据是交集 即可
image.png
image.png
image.png
image.png
网页工具是专用数据库借鉴,
image.png
tsne1和tsne2比PCA更加复杂 的降维方式
每一个点是一个 细胞,上面是marker基因,发现这两个圈圈里的细胞是这些基因的 表达比较高一点
右边是一个热图,展示不同簇的marker基因是哪些,很明显有4个簇,0123,表达量高的基因是哪些
最下面是对0-4注释是哪些细胞类型
就知道了12对应癌细胞,03对应非癌细胞
image.png
把B中 的 6个基因的表达量作为颜色 变量 投射到细胞点 上 ,都是 在tumor上 表达 高 ,在 非癌表达低
image.png
image.png
网页工具叫timer,免疫细胞丰度,展示是这些基因与6种免疫细胞的相关性程度,第一个是肿瘤纯度矫正
image.png

Seurat标准流程

括号前的英文单词是函数 ,
data/是文件夹 ,里面包含标准的输入数据
barcode相当于 样本ID
genes是ensbel和symbol的对应
第三个matrix是矩阵
高变基因是指方差 大的那些基因

image.png
image.png

1.数据和R包准备

代码:https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html
数据:https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

  1. rm(list = ls())
  2. options(stringsAsFactors = F)
  3. library(dplyr)
  4. library(Seurat)
  5. library(patchwork)

2.读取数据

10X的输入数据是固定的三个文件,在工作目录下新建01_data/,把三个文件放进去。

  1. ##Read10X函数读取数据
  2. pbmc.data <- Read10X(data.dir = "01_data/")
  3. dim(pbmc.data)
  4. ## [1] 32738 2700
  5. 2700个细胞 32738个基因
  6. ##CreateSeuratObject读进来的数据创建为serut对象并过滤,min.cells 至少在3个细胞里面表达的基因
  7. ##和min.features是一个细胞要有200及以上的基因表达量才保留细胞,
  8. pbmc <- CreateSeuratObject(counts = pbmc.data,
  9. project = "pbmc3k",
  10. min.cells = 3,
  11. min.features = 200)
  12. pbmc
  13. ## An object of class Seurat
  14. ## 13714 features across 2700 samples within 1 assay
  15. ## Active assay: RNA (13714 features, 0 variable features)
  16. # 查看表达矩阵
  17. ##具体怎么提取要自己根据对象来探索更改
  18. exp = pbmc@assays$RNA@counts;dim(exp)
  19. ## [1] 13714 2700
  20. exp[30:34,1:4]##列名当成细胞的编号
  21. ## 5 x 4 sparse Matrix of class "dgCMatrix"
  22. ## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
  23. ## MRPL20 1 . 1 .
  24. ## ATAD3C . . . .
  25. ## ATAD3B . . . .
  26. ## ATAD3A . . . .
  27. ## SSU72 . 1 . 3
  28. boxplot(as.matrix(exp[,1:20]))
  29. ##简单画一下前20个基因的表达量

image.png

3.质控

这里是对细胞进行的质控,指标是:
线粒体基因含量不能过高;
nFeature_RNA 不能过高或过低

为什么? nFeature_RNA是每个细胞中检测到的基因数量。nCount_RNA是细胞内检测到的分子总数。nFeature_RNA过低,表示该细胞可能已死/将死或是空液滴。太高的nCount_RNA和/或nFeature_RNA表明“细胞”实际上可以是两个或多个细胞。结合线粒体基因count数除去异常值,即可除去大多数双峰/死细胞/空液滴,因此它们过滤是常见的预处理步骤。 参考自:https://www.biostars.org/p/407036/

  1. ## PercentageFeatureSet专用来处理的函数 ,^以什么开头,以MT开头就是线粒体基因
  2. #percent.mt包含了每一个细胞里包含的线粒体比例
  3. pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
  4. head(pbmc@meta.data, 3)
  5. ## orig.ident nCount_RNA nFeature_RNA percent.mt
  6. ## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
  7. ## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
  8. ## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
  9. ##然后画一个图看看VlnPlot
  10. VlnPlot(pbmc,
  11. features = c("nFeature_RNA",
  12. "nCount_RNA",
  13. "percent.mt"),
  14. ncol = 3)

image.png
根据这个三个图,确定了这个数据的过滤标准:
nFeature_RNA在200~2500之间;线粒体基因占比在5%以下。

3.2 三个指标之间的相关性

  1. plot1 <- FeatureScatter(pbmc,
  2. feature1 = "nCount_RNA",
  3. feature2 = "percent.mt")
  4. plot2 <- FeatureScatter(pbmc,
  5. feature1 = "nCount_RNA",
  6. feature2 = "nFeature_RNA")
  7. plot1 + plot2

image.png

3.4 过滤

  1. ##subset函数进行过滤,标准就是前面筛选的
  2. pbmc <- subset(pbmc,
  3. subset = nFeature_RNA > 200 &
  4. nFeature_RNA < 2500 &
  5. percent.mt < 5)
  6. dim(pbmc)
  7. ## [1] 13714 2638

4.找高变基因(HVG)

  1. ##默认参数是找2000个
  2. pbmc <- NormalizeData(pbmc)
  3. pbmc <- FindVariableFeatures(pbmc)
  4. top10 <- head(VariableFeatures(pbmc), 10);top10
  5. ## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4" "FTH1"
  6. ## [9] "GNG11" "S100A8"
  7. 这里选了2000个,把前十个在图上标记出来。
  8. plot1 <- VariableFeaturePlot(pbmc)
  9. plot2 <- LabelPoints(plot = plot1,
  10. points = top10,
  11. repel = TRUE)
  12. plot1 + plot2

一个是只画了高变基因有哪些,另一个标记了前 10个名字
image.png

5. 标准化和降维

  1. all.genes <- rownames(pbmc)
  2. pbmc <- ScaleData(pbmc, features = all.genes)
  3. pbmc[["RNA"]]@scale.data[30:34,1:3]##简单看下
  4. ## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
  5. ## MRPL20 1.50566156 -0.56259931 1.24504892
  6. ## ATAD3C -0.04970561 -0.04970561 -0.04970561
  7. ## ATAD3B -0.10150202 -0.10150202 -0.10150202
  8. ## ATAD3A -0.13088200 -0.13088200 -0.13088200
  9. ## SSU72 -0.68454728 0.58087748 -0.68454728

5.1 线性降维PCA

  1. pbmc <- RunPCA(pbmc, features = VariableFeatures(pbmc))
  2. # 查看前5个主成分由哪些feature组成
  3. ##只展示前 5个
  4. print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
  5. ## PC_ 1
  6. ## Positive: CST3, TYROBP, LST1, AIF1, FTL
  7. ## Negative: MALAT1, LTB, IL32, IL7R, CD2
  8. ## PC_ 2
  9. ## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
  10. ## Negative: NKG7, PRF1, CST7, GZMB, GZMA
  11. ## PC_ 3
  12. ## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
  13. ## Negative: PPBP, PF4, SDPR, SPARC, GNG11
  14. ## PC_ 4
  15. ## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
  16. ## Negative: VIM, IL7R, S100A6, IL32, S100A8
  17. ## PC_ 5
  18. ## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
  19. ## Negative: LTB, IL7R, CKB, VIM, MS4A7
  20. ##展示自己的基因使用VizDimLoadings
  21. VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

image.png

  1. #每个主成分对应基因的热图
  2. DimHeatmap(pbmc, dims = 1:15, cells = 500)

image.png

  1. # 应该选多少个主成分进行后续分析:方法一
  2. ElbowPlot(pbmc)

image.png

  1. :方法二:选P<0.05
  2. pbmc <- JackStraw(pbmc, num.replicate = 100)
  3. pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
  4. JackStrawPlot(pbmc, dims = 1:20)

image.png

  1. #PC1和2
  2. 画一下PCA
  3. DimPlot(pbmc, reduction = "pca")+ NoLegend()

image.png

  1. # 结合JackStrawPlot和ElbowPlot,挑选10个PC,所以这里dims定义为1:10
  2. 一般 选多一点比较好 resolution = 0.5是分辨率,数值大小会关系着分群的数的多少,0.1-1
  3. pbmc <- FindNeighbors(pbmc, dims = 1:10)
  4. pbmc <- FindClusters(pbmc, resolution = 0.5) #分辨率
  5. ## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
  6. ##
  7. ## Number of nodes: 2638
  8. ## Number of edges: 95927
  9. ##
  10. ## Running Louvain algorithm...
  11. ## Maximum modularity in 10 random starts: 0.8728
  12. ## Number of communities: 9
  13. ## Elapsed time: 0 seconds
  14. # 结果聚成几类,用Idents查看
  15. length(levels(Idents(pbmc)))
  16. ## [1] 9

5.2 UMAP 和t-sne

PCA是线性降维,这两个是非线性降维

  1. pbmc <- RunUMAP(pbmc, dims = 1:10)
  2. DimPlot(pbmc, reduction = "umap")

image.png

6.找marker基因

啥叫marker基因呢。和差异基因里面的上调基因有点类似,某个基因在某一簇细胞里表达量都很高,在其他簇表达量很低,那么这个基因就是这簇细胞的象征。
找全部cluster的maker基因

  1. pbmc.markers <- FindAllMarkers(pbmc,
  2. only.pos = TRUE, # 只返回positive基因
  3. min.pct = 0.25) #只计算至少在(两簇细胞总数的)25%的细胞中有表达的基因
  4. pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
  5. ## # A tibble: 18 x 7
  6. ## # Groups: cluster [9]
  7. ## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
  8. ## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
  9. ## 1 3.75e-112 1.09 0.912 0.592 5.14e-108 0 LDHB
  10. ## 2 9.57e- 88 1.36 0.447 0.108 1.31e- 83 0 CCR7
  11. ## 3 0 5.57 0.996 0.215 0 1 S100A9
  12. ## 4 0 5.48 0.975 0.121 0 1 S100A8
  13. ## 5 1.06e- 86 1.27 0.981 0.643 1.45e- 82 2 LTB
  14. ## 6 2.97e- 58 1.23 0.42 0.111 4.07e- 54 2 AQP3
  15. ## 7 0 4.31 0.936 0.041 0 3 CD79A
  16. ## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
  17. ## 9 5.61e-202 3.10 0.983 0.234 7.70e-198 4 CCL5
  18. ## 10 7.25e-165 3.00 0.577 0.055 9.95e-161 4 GZMK
  19. ## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
  20. ## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
  21. ## 13 7.95e-269 4.83 0.961 0.068 1.09e-264 6 GZMB
  22. ## 14 3.13e-191 5.32 0.961 0.131 4.30e-187 6 GNLY
  23. ## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
  24. ## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
  25. ## 17 9.25e-186 7.29 1 0.011 1.27e-181 8 PF4
  26. ## 18 1.92e-102 8.59 1 0.024 2.63e- 98 8 PPBP

6.1 比较某个基因在几个cluster之间的表达量

小提琴图

  1. VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

image.png

  1. #可以拿count数据画
  2. VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

image.png

  1. umap图上标记
  2. FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))

image.png

  1. ##把每簇里面logfc排名前10的基因挑出来
  2. top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)

6.2 marker基因的热图

  1. DoHeatmap(pbmc, features = top10$gene) + NoLegend()

image.png

7. 根据marker基因确定细胞

  1. ###每一种细胞都是有几个基因高表达是什么细胞,直接手动写
  2. new.cluster.ids <- c("Naive CD4 T",
  3. "CD14+ Mono",
  4. "Memory CD4 T",
  5. "B",
  6. "CD8 T",
  7. "FCGR3A+ Mono",
  8. "NK",
  9. "DC",
  10. "Platelet")
  11. names(new.cluster.ids) <- levels(pbmc)
  12. pbmc <- RenameIdents(pbmc, new.cluster.ids)
  13. DimPlot(pbmc,
  14. reduction = "umap",
  15. label = TRUE,
  16. pt.size = 0.5) + NoLegend()

image.png

GSE67980

  1. a = read.delim("GSE67980_readCounts.txt.gz")
  2. 只有pr开头的是细胞
  3. colnames(a)
  4. ## [1] "ID" "Entrez.GeneID" "uniGene" "symbol"
  5. ## [5] "name" "Pr10.1.2" "Pr10.1.3" "Pr10.2.1"
  6. ## [9] "Pr11.1.2" "Pr11.2.1" "Pr11.2.3" "Pr12.1.1"
  7. ## [13] "Pr12.1.2" "Pr13.1.1" "Pr13.1.2" "Pr14.3.2"
  8. ## [17] "Pr14.3.3" "Pr15.1.2" "Pr16.1.1" "Pr16.1.2"
  9. ## [21] "Pr16.1.3" "Pr16.1.4" "Pr16.1.6" "Pr18.1.2"
  10. ## [25] "Pr18.1.3" "Pr18.1.9" "Pr18.2.4" "Pr20.1.2"
  11. ## [29] "Pr22.1.1" "Pr22.1.3" "Pr9.1.5" "Pr9.1.7"
  12. ## [33] "Pr9.2.1" "Pr9.2.2" "Pr9.3.2" "Pr9.3.3"
  13. ## [37] "Pr9.3.4" "Pr9.3.5" "Pr1.1.1" "Pr2.1.1"
  14. ## [41] "Pr2.1.2" "Pr3.1.2" "Pr4.1.1" "Pr4.1.2"
  15. ## [45] "Pr5.1.1" "Pr5.1.2" "Pr5.1.3" "Pr6.1.3"
  16. ## [49] "Pr6.1.4" "Pr8.1.2" "Pr10.2.2" "Pr11.1.1"
  17. ## [53] "Pr11.2.2" "Pr11.3.1" "Pr11.3.2" "Pr11.3.3"
  18. ## [57] "Pr11.4.1" "Pr11.4.2" "Pr11.4.3" "Pr11.4.4"
  19. ## [61] "Pr11.4.5" "Pr11.4.6" "Pr14.1.1" "Pr14.2.1"
  20. ## [65] "Pr14.2.2" "Pr14.2.3" "Pr14.2.4" "Pr14.2.5"
  21. ## [69] "Pr14.2.6" "Pr14.3.1" "Pr14.3.4" "Pr14.3.5"
  22. ## [73] "Pr14.3.6" "Pr15.1.1" "Pr17.1.1" "Pr17.1.2"
  23. ## [77] "Pr17.2.1" "Pr17.2.2" "Pr9.1.1" "Pr9.1.2"
  24. ## [81] "Pr9.1.3" "Pr9.1.4" "Pr9.1.6" "Pr9.3.1"
  25. ## [85] "Pr9.3.6" "Pr9.3.7" "Pr9.3.8" "Pr18.1.1"
  26. ## [89] "Pr18.1.4" "Pr18.1.5" "Pr18.1.6" "Pr18.1.7"
  27. ## [93] "Pr18.1.8" "Pr18.2.1" "Pr18.2.2" "Pr18.2.3"
  28. ## [97] "Pr19.1.2" "Pr19.1.3" "Pr19.1.4" "Pr19.1.5"
  29. ## [101] "Pr20.1.1" "Pr21.1.1" "Pr21.1.10" "Pr21.1.11"
  30. ## [105] "Pr21.1.12" "Pr21.1.2" "Pr21.1.3" "Pr21.1.4"
  31. ## [109] "Pr21.1.5" "Pr21.1.6" "Pr21.1.7" "Pr21.1.8"
  32. ## [113] "Pr21.1.9" "Pr22.1.10" "Pr22.1.11" "Pr22.1.12"
  33. ## [117] "Pr22.1.2" "Pr22.1.4" "Pr22.1.5" "Pr22.1.6"
  34. ## [121] "Pr22.1.7" "Pr22.1.8" "Pr22.1.9" "Pr3.1.1"
  35. ## [125] "Pr6.1.1" "Pr6.1.2" "Pr6.1.5" "PriTum1"
  36. ## [129] "PriTum2" "PriTum3" "PriTum4" "PriTum10"
  37. ## [133] "PriTum11" "PriTum12" "PriTum5" "PriTum6"
  38. ## [137] "PriTum7" "PriTum8" "PriTum9" "DU145.1"
  39. ## [141] "DU145.2" "DU145.3" "DU145.4" "DU145.5"
  40. ## [145] "LNCaP.1" "LNCaP.2" "LNCaP.3" "LNCaP.4"
  41. ## [149] "LNCaP.5" "LNCaP.D.1" "LNCaP.D.2" "LNCaP.D.3"
  42. ## [153] "LNCaP.D.4" "LNCaP.D.5" "LNCaP.R.1" "LNCaP.R.2"
  43. ## [157] "LNCaP.R.3" "LNCaP.R.4" "LNCaP.R.5" "PC3.1"
  44. ## [161] "PC3.2" "PC3.3" "PC3.4" "PC3.5"
  45. ## [165] "VCaP.1" "VCaP.2" "VCaP.3" "VCaP.4"
  46. ## [169] "VCaP.5" "Pr23.1.w1" "Pr23.1.w2" "HD1.1.w1"
  47. ## [173] "HD1.1.w2" "HD1.1.w3"
  48. b = rio::import("GSE67980_sampleProperties.txt.gz")
  49. b = b[b$characteristics..donor.type=="CRPC",]
  50. table(b$characteristics..donor)
  51. ##
  52. ## Pr10 Pr11 Pr12 Pr13 Pr14 Pr15 Pr16 Pr17 Pr18 Pr19 Pr20 Pr21 Pr22 Pr23 Pr9
  53. ## 4 14 2 2 13 2 5 4 13 4 2 12 12 2 17
  54. a = a[a$symbol!="",]
  55. a = a[!duplicated(a$symbol),]
  56. a2 = a[,b$title]
  57. rownames(a2) = a$symbol
  58. rownames(b) = b$title
  59. library(Seurat)
  60. a2就是整理好的表达矩阵了,要加上临床信息b
  61. pbmc <- CreateSeuratObject(counts = a2,
  62. meta.data = b,
  63. project = "a",
  64. min.cells = 3,
  65. min.features = 50)
  66. exp = pbmc[["RNA"]]@counts;dim(exp)
  67. ## [1] 14798 108
  68. exp[1:4,1:4]
  69. ## 4 x 4 sparse Matrix of class "dgCMatrix"
  70. ## Pr10.1.2 Pr10.1.3 Pr10.2.1 Pr11.1.2
  71. ## A1BG . . . .
  72. ## ADA . . . .
  73. ## CDH2 . . . .
  74. ## AKT3 1 . . .
  75. pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
  76. head(pbmc@meta.data, 3)
  77. ## orig.ident nCount_RNA nFeature_RNA title source.name
  78. ## Pr10.1.2 a 3338244 2517 Pr10.1.2 candidate PCa CTC
  79. ## Pr10.1.3 a 4291749 1776 Pr10.1.3 candidate PCa CTC
  80. ## Pr10.2.1 a 7100153 3353 Pr10.2.1 candidate PCa CTC
  81. ## organism characteristics..donor characteristics..donor.type
  82. ## Pr10.1.2 Homo sapiens Pr10 CRPC
  83. ## Pr10.1.3 Homo sapiens Pr10 CRPC
  84. ## Pr10.2.1 Homo sapiens Pr10 CRPC
  85. ## characteristics..enzalutamide molecule percent.mt
  86. ## Pr10.1.2 single cell in lysis buffer 0
  87. ## Pr10.1.3 single cell in lysis buffer 0
  88. ## Pr10.2.1 single cell in lysis buffer 0
  89. VlnPlot(pbmc, group.by= "characteristics..donor",
  90. features = c("nFeature_RNA",
  91. "nCount_RNA",
  92. "percent.mt"),
  93. ncol = 3)

image.png

  1. pbmc <- NormalizeData(pbmc)
  2. pbmc@assays$RNA[30:34,1:3]
  3. ## 5 x 3 sparse Matrix of class "dgCMatrix"
  4. ## Pr10.1.2 Pr10.1.3 Pr10.2.1
  5. ## HOTAIR . . .
  6. ## ZGLP1 . . .
  7. ## FAM86JP . . .
  8. ## GHRLOS . . .
  9. ## LOC100128164 . . .
  10. # 有三种算法:vst、mean.var.plot、dispersion;默认选择2000个HVG
  11. pbmc <- FindVariableFeatures(pbmc,
  12. nfeatures = 1500)
  13. top10 <- head(VariableFeatures(pbmc), 10);top10
  14. ## [1] "HBB" "PLA2G2A" "KLK3" "PPBP" "ANXA11" "LUZP6" "PDLIM1"
  15. ## [8] "RAB11A" "TSC22D1" "TOX4"
  16. plot1 <- VariableFeaturePlot(pbmc)
  17. plot2 <- LabelPoints(plot = plot1,
  18. points = top10,
  19. repel = TRUE)
  20. plot1 + plot2

image.png

  1. ### 标准化和降维
  2. all.genes <- rownames(pbmc)
  3. pbmc <- ScaleData(pbmc, features = all.genes)
  4. pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
  5. print(pbmc[["pca"]], dims = 1:2, nfeatures = 5)
  6. ## PC_ 1
  7. ## Positive: HBB, PF4, NRGN, ALAS2, PPBP
  8. ## Negative: KLK3, AR, DENND4B, CCDC14, RCE1
  9. ## PC_ 2
  10. ## Positive: CX3CR1, RAB7L1, TFAP4, GZMB, PTGER2
  11. ## Negative: FTH1, NLRP9, SLCO1B3, ZNF70, TSR1
  12. VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

image.png

  1. #每个主成分对应基因的热图
  2. DimHeatmap(pbmc, dims = 1:4, balanced = TRUE)

image.png

  1. # 应该选多少个主成分进行后续分析
  2. pbmc <- JackStraw(pbmc, num.replicate = 100)
  3. pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
  4. JackStrawPlot(pbmc, dims = 1:20)

image.png

  1. ElbowPlot(pbmc)

image.png

  1. pbmc <- FindNeighbors(pbmc, dims = c(1:3,11))
  2. pbmc <- FindClusters(pbmc, resolution = 0.5) #分辨率
  3. ## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
  4. ##
  5. ## Number of nodes: 108
  6. ## Number of edges: 2817
  7. ##
  8. ## Running Louvain algorithm...
  9. ## Maximum modularity in 10 random starts: 0.6789
  10. ## Number of communities: 2
  11. ## Elapsed time: 0 seconds
  12. # 结果聚成几类,用Idents查看
  13. length(levels(Idents(pbmc)))
  14. ## [1] 2
  15. pbmc <- RunTSNE(pbmc, dims = c(1:3,11))
  16. DimPlot(pbmc,reduction = "tsne")

image.png

  1. library(tidyverse)
  2. pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
  3. pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
  4. ## # A tibble: 4 x 7
  5. ## # Groups: cluster [2]
  6. ## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
  7. ## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
  8. ## 1 0.00000110 4.26 0.889 0.537 0.0163 0 GNLY
  9. ## 2 0.00206 4.40 0.944 0.963 1 0 DYNLRB1
  10. ## 3 0.0000000162 3.04 0.963 0.796 0.000239 1 TRIB1
  11. ## 4 0.000142 3.35 0.741 0.611 1 1 AMACR
  12. top6 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 6, wt = avg_log2FC)
  13. DoHeatmap(pbmc, features = top6$gene) + NoLegend()

image.png

  1. DotPlot(pbmc, features = top6$gene) + RotatedAxis()

image.png

  1. VlnPlot(pbmc, features = top6$gene)

image.png

  1. FeaturePlot(pbmc, features = top6$gene,cols = c("green","red"))

image.png```

  1. # 注释
  2. SingleR可以帮你自己注释免疫细胞,BlueprintEncodeData就是,人和小鼠应该可以做
  3. library(celldex)
  4. library(SingleR)
  5. ref <- BlueprintEncodeData()
  6. ref
  7. ## class: SummarizedExperiment
  8. ## dim: 19859 259
  9. ## metadata(0):
  10. ## assays(1): logcounts
  11. ## rownames(19859): TSPAN6 TNMD ... LINC00550 GIMAP1-GIMAP5
  12. ## rowData names(0):
  13. ## colnames(259): mature.neutrophil
  14. ## CD14.positive..CD16.negative.classical.monocyte ...
  15. ## epithelial.cell.of.umbilical.artery.1
  16. ## dermis.lymphatic.vessel.endothelial.cell.1
  17. ## colData names(3): label.main label.fine label.ont
  18. ref <- celldex::HumanPrimaryCellAtlasData()
  19. library(BiocParallel)
  20. scRNA = pbmc
  21. scRNA@assays$RNA@data表达矩阵数据,ref 是参考数据
  22. pred.scRNA <- SingleR(test = scRNA@assays$RNA@data,
  23. ref = ref,
  24. labels = ref$label.main,
  25. clusters = scRNA@active.ident)
  26. pred.scRNA$pruned.labels
  27. ## [1] "NK_cell" "Epithelial_cells"
  28. #查看注释准确性
  29. plotScoreHeatmap(pred.scRNA, clusters=pred.scRNA@rownames, fontsize.row = 9,show_colnames = T)

颜色代表注释准确性
image.png

  1. new.cluster.ids <- pred.scRNA$pruned.labels
  2. names(new.cluster.ids) <- levels(scRNA)
  3. levels(scRNA)
  4. ## [1] "0" "1"
  5. scRNA <- RenameIdents(scRNA,new.cluster.ids)
  6. levels(scRNA)
  7. ## [1] "NK_cell" "Epithelial_cells"
  8. TSNEPlot(object = scRNA, pt.size = 0.5, label = TRUE)

image.png###

03复现

image.png
33991个细胞,然后有几个群的细胞继续向下注释,然后继续找出来marker基因,当做差异表达基因
然后就是找训练集和测试集做一些分析,包括cox,kmplot,生存分析
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
Seurat4.0系列教程1:标准流程- 简书(jianshu.com)
单细胞测序(sc-RNA seq)分析:Seurat4.0系列教程、高级分析、文章复现)