- Cellranger
- 具体格式
is string of characters and/or - variables. The variables can be one of: $ac
- - accession, $si spot id, $sn spot
- name, $sg spot group (barcode), $sl spot
- length in bases, $ri read number, $rn
- read name, $rl read length in bases. ‘[]’
- could be used for an optional output: if
- all vars in [] yield empty values whole
- group is not printed. Empty value is empty
- string or for numeric variables. Ex:
- @$sn[$rn]/$ri ‘$rn’ is omitted if name
- is empty
- 举例子
- 查看
- fastqc查看数据质量情况
- cellranger使用
Cellranger
运行环境
Date:2021-4-9CentOS Linux release 7.8.2003 (Core)Architecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 144Model name: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz
软件安装与数据下载
需要**SRAtool**、**Aspera**、**cellranger**、**bcl2fastq**、**tree**
软件安装
#SRAtool安装v-2.9.2wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.2/sratoolkit.2.9.2-centos_linux64.tar.gz#下载后解压直接修改环境变量即可
#Aspera安装v-3.7.4wget http://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gztar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz#安装,安装结束后家目录如果有.aspera证明安装成功./aspera-connect-3.7.4.147727-linux-64.sh
#cellranger安装直接安装最新版的6.0.0,软件官网:https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latestwget -O cellranger-6.0.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-6.0.0.tar.gz?Expires=1618175827&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci02LjAuMC50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2MTgxNzU4Mjd9fX1dfQ__&Signature=BZ1xzJU1NGM-eikz5V59t0JIM2o6m2bXxM9N5Ce3GfuBkxLI2YJZskHeSFIBE9B9xiJpG22yFUBxSQOIAE2ABuyf4BvbKOBm4iRnPvCiePumiOQG9D-ocvCmhIQkFvFVkqR0700OazZSkdw-IyF6ywXNDdKuWBMq~mKYJnVq~nSC-cftxb5jco1wJvYD2PtC6vEYefY7G18XsSYrRif6HTa13cgGvK5MBsT6qachS6FOGafAalxlY5EGj428FruAVk2v-lnJXIDz97iIegyNPR6qfo4Qr9YwYrWw-zYRwmhWpWpW~HwyI-rOqioe6IzTQIjJ1mOAwAT7HT5bNFMmgw__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"#解压tar -zxvf cellranger-6.0.0.tar.gz#修改环境变量后即可使用#测试,可以输出系统配置信息cellranger sitecheck
#bcl2fastq安装#将下载好的rpm包进行解压unzip bcl2fastq2-v2-20-0-linux-x86-64.zip#因为没有root权限所以要直接提取rpm包中的文件rpm2cpio bcl2fastq2-v2.20.0.422-Linux-x86_64.rpm |cpio -idvm#提取出来后当前目录会出现usr目录,修改环境变量(将bin目录加入)/home/liuzihao/study/scrna/usr/local/bin:(加入环境变量)#测试(base) [liuzihao@bogon scrna]$ bcl2fastq -vBCL to FASTQ file converterbcl2fastq v2.20.0.422Copyright (c) 2007-2017 Illumina, Inc.
#tree安装wget ftp://mama.indstate.edu/linux/tree/tree-1.6.0.tgztar -zxvf tree-1.6.0.tgz#修该Makefile,指定安装路径(第19行)19 prefix = /home/liuzihao/tree#安装make -j 4 && make install
数据下载
参考基因组索引下载
#10x官方已经构建好的人基因组索引wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
测序数据下载(GSE117988)
#prefech无法调用ASPERA未解决,所以用了迅雷替代,只下载了一半,另一半实在是太大,网费emmmmmprefetch -t ascp -a "~/aspera/connet/bin/ascp|/~/aspera/connet/etc/asperaweb_id_dsa.openssh" --option-file ./data/list.txt#最终得到以下数据6个SRA文件,后续进行拆分与定量-rw-r--r--. 1 liuzihao student 2.6G Apr 11 10:37 SRR7722937-rw-r--r--. 1 liuzihao student 3.7G Apr 11 11:12 SRR7722938-rw-r--r--. 1 liuzihao student 1.8G Apr 11 10:54 SRR7722939-rw-r--r--. 1 liuzihao student 2.0G Apr 11 11:02 SRR7722940-rw-r--r--. 1 liuzihao student 1016M Apr 11 10:38 SRR7722941-rw-r--r--. 1 liuzihao student 1.9G Apr 11 11:04 SRR7722942
SRA文件拆分
fastq-dump工具
使用**fastq-dump**命令拆分所有SRA文件,
**--gzip / --bzip2**:输出文件压缩,输出fastq.gz**--split-file**:将双端测序分为两份,放在不同的文件,但是对于一方有而一方没有的reads直接丢弃https://www.biostars.org/p/186741/**--split-spot**:将双端测序分为两份,放在不同的文件,但是对于一方有而一方没有的reads会单独放在一个文件夹里**--split-3**:将双端测序分为两份,放在不同的文件,但是对于一方有而一方没有的reads会单独放在一个文件夹里(优先)**--split-file**:**-A**:指定输出文件的名字**-I | --readids**:双端reads加.1和.2区分 不需要**--defline-seq <fmt>**:指定序列名称的显示格式,如,--defline-qual '+'第三行则只显示加号**--defline-qual <fmt>**:指定质量信息名称的显示格式**<fmt>**具体格式:?$ac:序列编号$si: spot id$sD: spot group$sl:碱基长度$ri:reads编号,$rn:reads名称,$rl:reads长度```shell具体格式
is string of characters and/or variables. The variables can be one of: $ac
- accession, $si spot id, $sn spot
name, $sg spot group (barcode), $sl spot
length in bases, $ri read number, $rn
read name, $rl read length in bases. ‘[]’
could be used for an optional output: if
all vars in [] yield empty values whole
group is not printed. Empty value is empty
string or for numeric variables. Ex:
@$sn[$rn]/$ri ‘$rn’ is omitted if name
is empty
举例子
fastq-dump —gzip —split-files —defline-qual ‘+’ —defline-seq ‘>\$ac-\$si/\$ri’ ./SRR7722941查看
zless -SN SRR7722941_1.fastq.gz | head
\SRR7722941-\1/\1 CTAGGTGA + GGGGGIII \SRR7722941-\2/\1 CTAGGTGA + GGGGGIII \SRR7722941-\3/\1 CTAGGTGA ```
批量拆分脚本
1小时左右,脚本名:fastq.sh
#!/bin/bash#######################################################Author:liuzihao#Date:2021-4-11#CentOS Linux release 7.8.2003 (Core)#Architecture: x86_64#CPU op-mode(s): 32-bit, 64-bit#Byte Order: Little Endian#CPU(s): 144#Model name: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz######################################################function file_num(){count=0total=$(wc -l ls.txt | cut -d " " -f 1)}ls S* > ls.txtfile_numfor i in $(cat ls.txt)docount=$(( $count + 1 ))echo -ne "共${total}个文件,正在拆分第${count}个\r"fastq-dump --gzip --split-files -A ${i} ${i} &>> log2.txt && echo "${i}_Success" >> ./log.txt || echo "${i}_Fail" >> ./log.txtdonerm -rf ls.txt#创建fastq文件存放位置mkdir fastqdatamv ./*.gz ./fastqdata
拆分结果查看
(base) [liuzihao@bogon fastqdata]$ la | head -n 6total 19G-rw-r--r--. 1 liuzihao student 417M Apr 11 17:07 SRR7722937_1.fastq.gz-rw-r--r--. 1 liuzihao student 889M Apr 11 17:07 SRR7722937_2.fastq.gz-rw-r--r--. 1 liuzihao student 2.4G Apr 11 17:07 SRR7722937_3.fastq.gz
每个SRA文件被拆分成3个fastq文件,通过**zless -NS <FILE>**查看文件内容,zless与less用法一致
#<file>_1文件中每个序列均只有8bp可能为barcode或UMI或index(详细见后)(base) [liuzihao@bogon fastqdata]$ zless -SN SRR7722937_1.fastq.gz | head -n 6@SRR7722937.1 SN367:911:HKMNCBCXY:1:1103:1119:1866 length=8TTTCATGA+SRR7722937.1 SN367:911:HKMNCBCXY:1:1103:1119:1866 length=8GGGGGIII@SRR7722937.2 SN367:911:HKMNCBCXY:1:1103:1182:1935 length=8TTTCATGA#<file>_2文件中每个序列均只有26bp可能为barcode或UMI或index(详细见后)(base) [liuzihao@bogon fastqdata]$ zless -SN SRR7722937_2.fastq.gz | head -n 6@SRR7722937.1 SN367:911:HKMNCBCXY:1:1103:1119:1866 length=26AGCAGCCGTGACTACTGTATTGCGAC+SRR7722937.1 SN367:911:HKMNCBCXY:1:1103:1119:1866 length=26AGGGGIIIIIGGIIIIIIIIIIIIGG@SRR7722937.2 SN367:911:HKMNCBCXY:1:1103:1182:1935 length=26GCTCCTAAGACACTAAGGCCTGTACC#<file>_3文件中每个序列均只有98bp,应该为测序reads(详细见后)(base) [liuzihao@bogon fastqdata]$ zless -SN SRR7722937_3.fastq.gz | head -n 6@SRR7722937.1 SN367:911:HKMNCBCXY:1:1103:1119:1866 length=98NNNNNCTGTAATCCCAGCCAGGAGGACTGCTTGAACCCGGGAGGCAGAGGTTTCAGTGAGCTGAGTGCCACTGCACTCCAGCCTGGGTGACAGAGTGA+SRR7722937.1 SN367:911:HKMNCBCXY:1:1103:1119:1866 length=98#####<.<G<GGAAGGA.AGGGGGGIIIGAAGGGAAGG<AAAAAGGGAAA<GGGGGGGAAGGGAGGGA<GGGGIGGGAAAGGGGGGGGGGAGG..AAA@SRR7722937.2 SN367:911:HKMNCBCXY:1:1103:1182:1935 length=98GCGAAAAGTCCTAATAGTAGAAGAACCCTCCATAAACCTGGAGTGACTATATGGATGCCCGCCACCCTACCACACATTCGAAGAGCCCGTATACATAA
10x文库特征
(文库特征与白洋的土壤细菌宏培养组设计相似,index区分96孔板,barcode区分单菌16s)

**P5/P7接头**:与Flowcell上的p5p7接头互补,使reads可以固定于Flowcell上进行桥式PCR扩增
**index**:用于区分不同样本(文库)的序列,96孔板的每个孔中都加入了4种不同的index oligos混合,后续拆分序列可根据index进行区分,后续将具有相同4种oligo的fq文件组合在一起,表示来自同一个样本(文库)。
**barcode**:用于区分同样本(文库)中的cell
**UMI**:UMI就是分子标签(Unique Molecular Identifier),由4-10个随机核苷酸组成,在mRNA反转录后,进入到文库中,每一个mRNA随机连上一个UMI,结果可以计数不同的UMI,最终统计mRNA的数量,
#UMI原理: 就是给每一条原始DNA片段**加上一段**特有的**标签序列,经文库构建及PCR扩增后一起进行测序。这样,根据**不同的标签序列**我们就可以区分**不同来源#的DNA模板,分辨哪些是PCR扩增及测序过程中的随机错误造成的假阳性突变。
**Poly(dT)VN**: 随机引物,mRNA可以与其结合序列:TTTTTTTTT

- 首先,1-26个cycle就是测序得到了26个碱基,先是16个Barcode碱基,然后是10个UMI碱基;
- 然后,27-34这8个cycle得到了8个碱基,就是i7的sample index;
- 最后,35-132个cycle得到了98个碱基,就是转录本reads
#根据reads长度,上述所拆分出的3个fastq文件分别为,index、barcode、reads#<file>_1: 8bp>>>index#<file>_2: 26bp>>>barcode(16bp)+UMI(10bp)#<file>_3: 98bp >>>reads
fastq重命名
根据10x官方说明,需要对文件名进行修改

I1:index序列R1:barcode序列R2:reads序列
如,
SRR7722937_S1_L001_R1_001.fastq.gz
SRR7722937_S1_L001_R2_001.fastq.gz
SRR7722937_S1_L001_I1_001.fastq.gz
使用以下脚本批量重命名:(rename.sh)
#!/bin/bash#######################################################Author:liuzihao#Date:2021-4-12#CentOS Linux release 7.8.2003 (Core)#Architecture: x86_64#CPU op-mode(s): 32-bit, 64-bit#Byte Order: Little Endian#CPU(s): 144#Model name: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz#######################################################获取各个序列类型的文件列表echo "或取文件列表"ls -l | awk '{print $9}' | fgrep "_2" | cut -d "_" -f 1 > ls2.txtls -l | awk '{print $9}' | fgrep "_1" | cut -d "_" -f 1 > ls1.txtls -l | awk '{print $9}' | fgrep "_3" | cut -d "_" -f 1 > ls3.txtecho "开始重命名...."#重名名for a in $(cat ls1.txt)domv ${a}_1.fastq.gz ${a}_S1_L001_I1_001.fastq.gzdonefor b in $(cat ls2.txt)domv ${b}_2.fastq.gz ${b}_S1_L001_R1_001.fastq.gzdonefor c in $(cat ls3.txt)domv ${c}_3.fastq.gz ${c}_S1_L001_R2_001.fastq.gzdoneecho "命名结束,删除临时文件...."rm -rf ls*.txt#重命名结果如下(base) [liuzihao@bogon fastqdata]$ la S* | head -n 3-rw-r--r--. 1 liuzihao student 417M Apr 11 17:07 SRR7722937_S1_L001_I1_001.fastq.gz-rw-r--r--. 1 liuzihao student 889M Apr 11 17:07 SRR7722937_S1_L001_R1_001.fastq.gz-rw-r--r--. 1 liuzihao student 2.4G Apr 11 17:07 SRR7722937_S1_L001_R2_001.fastq.gz
fastqc查看数据质量情况
#使用fastqc对R1和R2碱基质量查看,主要为R2以为其为真实readsfastqc ./*R2* -o ./fastqc/ -t 64#multicqc整合fastqc结果multicqc ./fastqc/
结果查看
cellranger使用
软件自检
cellranger testrun --id=tiny#以下信息自检成功Waiting 6 seconds for UI to do final refresh.Pipestance completed successfully!
mkfastq
#官方资料#https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq#example_data
mkfastq基本原理与特点
mkfastq可以将混合测序的样品通过index信息将Illumina 测序仪的下机数据转换为fastq文件,本功能类似于Illumina 的bcl2fastq,特性如下,
- 将10X 样本index名称与四种寡核苷酸对应起来,比如A1孔是样本
SI-GA-A1,然后对应的寡核苷酸是GGTTTACT, CTAAACGG, TCGGCGTC, and AACCGTAA,那么程序就会去index文件中将存在这四种寡核苷酸的fastq组合到A1这个样本 - 提供质控结果,包括barcode 质量、总体测序质量如Q30、R1和R2的Q30碱基占比、测序reads数等
- 可以使用10X简化版的样本信息表
不同文库在同一条Lane测序
同一文库在不同Lane测序

mkfastq命令选项
**--run**:BCL文件的位置(路径)
**--id**:mkfastq创建的输出目录的名称
**--samplesheet**:Illumina测序数据文件的相关信息文件的位置,包括样品名、index信息、lane信息,**--sample-sheet**与上述一致
**--csv**:仅包括样品名、index信息、lane等信息的简单版samplesheet,**--simple-csv**与上述一致
**--filter-dual-index**: 仅拆分i5和i7确定的reads
**--filter-single-index**:仅拆分i7确定的reads
**--lanes**:指定拆分哪些lanse的reads
**--output-dir**:设定输出目录,代替默认的flowcell_id/outs/fastq_path
**--localcores**:限定使用cpu数量
**--localmem**:限定使用内存数量,GB
……

mkfastq用法
下载官网提供的3个示例文件,
#BCL文件wget https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz#samplesheet,10x简化版wget https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-simple-1.2.0.csv#samplesheet,Illumina版wget https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-samplesheet-1.2.0.csv
samplesheet,Illumina版

10x简化版

#使用简化版samplesheet对bcl进行拆分cellranger mkfastq --id=tin-bcl --run=./cellranger-tiny-bcl-1.2.0 --csv=./cellranger-tiny-bcl-simple-1.2.0.csv#运行结束后会输出各个文件的存放位置Outputs:- FASTQ output folder: /home/liuzihao/study/scrna/exbcl/tin-bcl/outs/fastq_path- Interop output folder: /home/liuzihao/study/scrna/exbcl/tin-bcl/outs/interop_path- Input samplesheet: /home/liuzihao/study/scrna/exbcl/tin-bcl/outs/input_samplesheet.csv(base) [liuzihao@bogon fastq_path]$ la *.gz-rw-r--r--. 1 liuzihao student 20M Apr 13 09:17 Undetermined_S0_L001_I1_001.fastq.gz-rw-r--r--. 1 liuzihao student 50M Apr 13 09:17 Undetermined_S0_L001_R1_001.fastq.gz-rw-r--r--. 1 liuzihao student 146M Apr 13 09:17 Undetermined_S0_L001_R2_001.fastq.gz
Illumina版
cellranger mkfastq --id=illumia --run=./cellranger-tiny-bcl-1.2.0 --samplesheet=cellranger-tiny-bcl-samplesheet-1.2.0.csv#运行结束后会输出各个文件的存放位置Outputs:- FASTQ output folder: /home/liuzihao/study/scrna/exbcl/illumia/outs/fastq_path- Interop output folder: /home/liuzihao/study/scrna/exbcl/illumia/outs/interop_path- Input samplesheet: /home/liuzihao/study/scrna/exbcl/illumia/outs/input_samplesheet.csv(base) [liuzihao@bogon fastq_path]$ la *.gz-rw-r--r--. 1 liuzihao student 20M Apr 13 09:17 Undetermined_S0_L001_I1_001.fastq.gz-rw-r--r--. 1 liuzihao student 50M Apr 13 09:17 Undetermined_S0_L001_R1_001.fastq.gz-rw-r--r--. 1 liuzihao student 146M Apr 13 09:17 Undetermined_S0_L001_R2_001.fastq.gz
counts
使用**cellranger count**进行定量(整个定量过程包含了比对(STAR)、质控、定量)
counts选项参数
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count
--id:指定输出文件目录的名称
--transcriptome:指定参考基因组索引位置
--fastqs:指定fastq文件位置
--expect-cells:预期得到cell数目,默认3000
--sample:指定样品名称,fastq文件的前缀中的sample保持一致,作为软件识别的标志,不指定则分析所有fastq文件
--nosecondary:不进行下游降维聚类等分析
--r1-length / --r2-length:将输入的R2/R1序列截短到指定长度
--localcores / --localmem:指定处理器数目和内存使用量
counts用法
cellranger count --id=count --transcriptome=/home/liuzihao/study/scrna/ref/refdata-gex-GRCh38-2020-A --fastqs=/home/liuzihao/study/scrna/exbcl/test --expect-cells=3000 --nosecondary##输出文件说明与位置Outputs:- Run summary HTML: /home/liuzihao/study/scrna/exbcl/test/count/outs/web_summary.html- Run summary CSV: /home/liuzihao/study/scrna/exbcl/test/count/outs/metrics_summary.csv- BAM: /home/liuzihao/study/scrna/exbcl/test/count/outs/possorted_genome_bam.bam- BAM index: /home/liuzihao/study/scrna/exbcl/test/count/outs/possorted_genome_bam.bam.bai- Filtered feature-barcode matrices MEX: /home/liuzihao/study/scrna/exbcl/test/count/outs/filtered_feature_bc_matrix- Filtered feature-barcode matrices HDF5: /home/liuzihao/study/scrna/exbcl/test/count/outs/filtered_feature_bc_matrix.h5- Unfiltered feature-barcode matrices MEX: /home/liuzihao/study/scrna/exbcl/test/count/outs/raw_feature_bc_matrix- Unfiltered feature-barcode matrices HDF5: /home/liuzihao/study/scrna/exbcl/test/count/outs/raw_feature_bc_matrix.h5- Secondary analysis output CSV: /home/liuzihao/study/scrna/exbcl/test/count/outs/analysis- Per-molecule read information: /home/liuzihao/study/scrna/exbcl/test/count/outs/molecule_info.h5- CRISPR-specific analysis: null- Loupe Browser file: /home/liuzihao/study/scrna/exbcl/test/count/outs/cloupe.cloupe- Feature Reference: null- Target Panel File: nullWaiting 6 seconds for UI to do final refresh.Pipestance completed successfully!#最重要的三个文件(base) [liuzihao@bogon filtered_feature_bc_matrix]$ latotal 2.8M-rw-r--r--. 1 liuzihao student 13K Apr 13 14:24 barcodes.tsv.gz-rw-r--r--. 1 liuzihao student 326K Apr 13 14:24 features.tsv.gz-rw-r--r--. 1 liuzihao student 2.5M Apr 13 14:24 matrix.mtx.gz
web_summary.html:官方说明 summary HTML filemetrics_summary.csv:CSV格式数据摘要possorted_genome_bam.bam:比对文件possorted_genome_bam.bam.bai:索引文件- **
filtered_gene_bc_matrices:是重要的一个目录,下面又包含了 barcodes.tsv.gz、features.tsv.gz、matrix.mtx.gz,是下游Seurat、Scater、Monocle等分析的输入文件 filtered_feature_bc_matrix.h5:过滤掉的barcode信息HDF5 formatraw_feature_bc_matrix:原始barcode信息raw_feature_bc_matrix.h5:原始barcode信息HDF5 formatanalysis:数据分析目录,下面又包含聚类clustering(有graph-based & k-means)、差异分析diffexp、主成分线性降维分析pca、非线性降维tsnemolecule_info.h5:下面进行aggregate使用的文件cloupe.cloupe:官方可视化工具Loupe Cell Browser 输入文件
mkgtf和mkref
使用**cellranger mkgtf**和**cellranger mkref**筛选GTF文件(构建新的gtf文件)以及构建参考基因组索引,也可以使用awk等命令自行提取
- 参考序列只能有很少的 overlapping gene annotations,因为reads比对到多个基因会导致流程检测的分子数更少(它只要uniquely mapped的结果)
- FASTA与GTF比对和STAR兼容,GTF文件的第三列(feature type)必须有exon
- 可以自己增加基因注释信息,按照gtf的格式添加,gtf具体信息参考
RNA-seq分析
eg:mylocus annotation exon 100 200 . + . gene_id “mygene”; transcript_id “mygene”;
mkgtf
#参考基因组下载wget ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz#参考基因组注释下载wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
使用**cellranger mkgtf**对GTF进行筛选:cellranger mkgtf <input_gtf> <output_gtf> [--attribute=KEY:VALUE...]
#!/bin/bashcellrangerr mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \--attribute=gene_biotype:protein_coding \--attribute=gene_biotype:lincRNA \--attribute=gene_biotype:antisense \--attribute=gene_biotype:IG_LV_gene \--attribute=gene_biotype:IG_V_gene \--attribute=gene_biotype:IG_V_pseudogene \--attribute=gene_biotype:IG_D_gene \--attribute=gene_biotype:IG_J_gene \--attribute=gene_biotype:IG_J_pseudogene \--attribute=gene_biotype:IG_C_gene \--attribute=gene_biotype:IG_C_pseudogene \--attribute=gene_biotype:TR_V_gene \--attribute=gene_biotype:TR_V_pseudogene \--attribute=gene_biotype:TR_D_gene \--attribute=gene_biotype:TR_J_gene \--attribute=gene_biotype:TR_J_pseudogene \--attribute=gene_biotype:TR_C_gene
筛选结果查看
#假定只筛选TR_C_genecellranger mkgtf Homo_sapiens.GRCh38.84.gtf filtered.gtf --attribute=gene_biotype:TR_C_gene#查看筛选后的GTFcat filtered.gtf |grep -v "#" |awk -v FS='gene_biotype ' 'NF>1{print $2}'|awk -F ";" '{print $1}'|sort | uniq -c#只有1种125 "TR_C_gene"#查看原gtf文件cat Homo_sapiens.GRCh38.84.gtf |grep -v "#" |awk -v FS='gene_biotype ' 'NF>1{print $2}'|awk -F ";" '{print $1}'|sort | uniq -c128 "3prime_overlapping_ncrna"45662 "antisense"24 "bidirectional_promoter_lncrna"213 "IG_C_gene"33 "IG_C_pseudogene"152 "IG_D_gene"76 "IG_J_gene"......
mkref
使用**cellranger mkref**构建参考基因组索引
--genome: 输出的索引目录名称
--fasta:参考基因组序列
--genes:参考基因组GTF文件
--nthreads:线程数
--memgb:内存限制数量
--ref-version:Reference version string to include with reference.?
# 构建参考基因组indexcellranger mkref --genome=testout --fasta=Homo_sapiens.GRCh38.dna.primary_assembly.fa --genes=Homo_sapiens.GRCh38.ensembl.filtered.gtf --nthreads=32#以下提示构建成功>>> Reference successfully created! <<<#构建的索引目录结构testout/├── fasta│ ├── genome.fa│ └── genome.fa.fai├── genes│ └── genes.gtf.gz├── reference.json└── star├── chrLength.txt├── chrNameLength.txt├── chrName.txt├── chrStart.txt├── exonGeTrInfo.tab├── exonInfo.tab├── geneInfo.tab├── Genome├── genomeParameters.txt├── SA├── SAindex├── sjdbInfo.txt├── sjdbList.fromGTF.out.tab├── sjdbList.out.tab└── transcriptInfo.tab#自建索引测试cellranger count --id=count37 --fastqs=./srr37/ --transcriptome=/home/liuzihao/study/scrna/refseq/testout --expect-cells=3000 --nosecondary &> log2.txt &#成功Waiting 6 seconds for UI to do final refresh.Pipestance completed successfully!2021-04-13 23:20:55 Shutting down.
多物种整合
#没有gtf可以自己根据NCBI上的序列信息自己造一个https://mp.weixin.qq.com/s/_VGFGmYBJmYm_4KLc9zamgcellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf \--genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf
aggr
aggr选项参数
--id:输出结果目录名
--csv:样品信息文件,详细见下
当处理多个生物学样本或者一个样本存在多个重复/文库时,最好的操作就是先分别对每个文库进行单独的count定量,然后将定量结果利用cellranger aggr组合起来
#分别进行3个定量的流程(下载的SRA数据,37 38 39 分别跑)cellranger count --id=count37 --fastqs=./srr37/ --transcriptome=/home/liuzihao/study/scrna/ref/refdata-gex-GRCh38-2020-A --expect-cells=3000 --nosecondary &>log.txt &cellranger count --id=count38 --fastqs=./srr38/ --transcriptome=/home/liuzihao/study/scrna/ref/refdata-gex-GRCh38-2020-A --expect-cells=3000 --nosecondary &>log2.txt &cellranger count --id=count39 --fastqs=./srr39/ --transcriptome=/home/liuzihao/study/scrna/ref/refdata-gex-GRCh38-2020-A --expect-cells=3000 --nosecondary &>log3.txt &
构建分组信息,csv文件,包括sample_id和molecule_h5文件的位置
(base) [liuzihao@bogon fastqdata]$ cat agg.csvsample_id,molecule_h5count37,/home/liuzihao/study/scrna/rawdata/fastqdata/count37/outs/molecule_info.h5count38,/home/liuzihao/study/scrna/rawdata/fastqdata/count38/outs/molecule_info.h5count39,/home/liuzihao/study/scrna/rawdata/fastqdata/count39/outs/molecule_info.h5
合并文件
cellranger aggr --id=counts --csv=agg.csv#提示以下信息则代表合并成功Outputs:- Copy of the input aggregation CSV: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/aggregation.csv- Aggregation metrics summary HTML: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/web_summary.html- count:Aggregation metrics summary JSON: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/count/summary.jsonSecondary analysis output CSV: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/count/analysisFiltered feature-barcode matrices MEX: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/count/filtered_feature_bc_matrixFiltered feature-barcode matrices HDF5: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/count/filtered_feature_bc_matrix.h5Unfiltered feature-barcode matrices MEX: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/count/raw_feature_bc_matrixUnfiltered feature-barcode matrices HDF5: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/count/raw_feature_bc_matrix.h5Loupe Browser file: /home/liuzihao/study/scrna/rawdata/fastqdata/count/outs/count/cloupe.cloupe- vdj_t: null- vdj_b: null- V(D)J reference: nullWaiting 6 seconds for UI to do final refresh.Pipestance completed successfully!
