9 不同 vcf 注释结果转 maf - 《肿瘤外显子数据分析指南》

ANNOVAR
GATK
VEP
SNPEFF

ANNOVAR

ANNOVAR 注释后拿到的结果，还需要做一个处理，就是在每一个 ${id}.hg38_multianno.txt文件中增加两列，为 Tumor_Sample_Barcode 和 Matched_Norm_Sample_Barcode，而最后的一列 Otherinfo 我们暂时丢弃掉（如果你需要就不要丢弃），添加上这两列信息之后再把所有的 vcf 文件合并成一个 annovar_merge.vcf。处理的方法如下：

cat config | while read id
do 
    grep -v '^Chr' ./7.annotation/annovar/${id}.hg38_multianno.txt | cut -f 1-20 | awk -v T=${id} -v N=${id:0:5}_germline '{print $0"\t"T"\t"N}'  >./7.annotation/annovar/${id}.annovar.vcf 
done
head -1 ./7.annotation/annovar/case1_biorep_A_techrep.hg38_multianno.txt| sed 's/Otherinfo/Tumor_Sample_Barcode\tMatched_Norm_Sample_Barcode/' >./7.annotation/annovar/header
cat ./7.annotation/annovar/header ./7.annotation/annovar/*annovar.vcf >./7.annotation/annovar/annovar_merge.vcf

最后可以用 R 包maftools中一个函数annovarToMaf将 ANNOVAR 注释的结果转为 maf。请注意，前面的操作我们都是用 Linux 或 shell 脚本处理，在这里我们第一次用到 R 语言，后面的分析也会将 Linux 与 R 交替使用。

rm(list = ls())
require(maftools)
options(stringsAsFactors = F)
## annovar
annovar.laml <- annovarToMaf(annovar = "./7.annotation/annovar/annovar_merge.vcf", 
                     refBuild = 'hg38',
                     tsbCol = 'Tumor_Sample_Barcode', 
                     table = 'refGene',
                     MAFobj = T)
## 统计一下case1_biorep_A_techrep样本的突变类型
table(annovar.laml@data[annovar.laml@data$Tumor_Sample_Barcode=='case1_biorep_A_techrep',]$Func.knownGene)
## exonic intergenic 
##   172          1
table(annovar.laml@data[annovar.laml@data$Tumor_Sample_Barcode=='case1_biorep_A_techrep',]$Variant_Type)
## DEL INS SNP 
##  5   0 168
table(annovar.laml@data[annovar.laml@data$Tumor_Sample_Barcode=='case1_biorep_A_techrep',]$Variant_Classification)
##   Frame_Shift_Del 
##                 4 
##   Frame_Shift_Ins 
##                 0 
##      In_Frame_Del 
##                 1 
##      In_Frame_Ins 
##                 1 
## Missense_Mutation 
##               153 
## Nonsense_Mutation 
##                14 
##  Nonstop_Mutation 
##                 0

最后拿到的 annovar.laml，就是一个 maftools 对象了，可以进行后续的画图操作，这部分我们留到下一节再讲。

GATK

GATK 的 Funcotator 虽然注释上了很多数据库信息，但是很多结果为__UNKNOWN__。注释后直接输出的 maf 文件也是有缺陷的，Tumor_Sample_Barcode 和 Matched_Norm_Sample_Barcode这两列在 maf 文件中也是 __UNKNOWN__，可能是我的参数设置有问题，但是看了帮助文档：https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.4.0/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php，并没有发现相关的参数，知道的朋友可以留言。在此，我只能暂时手动添加一下了，方法如下：

## 添加Barcode
cat config | while read id
do 
    grep -v '^#' ./7.annotation/funcotator/${id}_funcotator.tmp.maf| grep -v '^Hugo_Symbol'| awk -v T=${id} -v N=${id:0:5}_germline 'BEGIN{FS="\t";OFS="\t"}{$16=T;$17=N;print $0}' >./7.annotation/funcotator/${id}_funcotator.maf 
done
## 取出一个列名
grep 'Hugo_Symbol' ./7.annotation/funcotator/case1_biorep_A_techrep_funcotator.tmp.maf >./7.annotation/funcotator/header
## 删除掉旧文件
rm ./7.annotation/funcotator/*tmp.maf
## 合并所有的样本的maf
cat ./7.annotation/funcotator/header ./7.annotation/funcotator/*maf >./7.annotation/funcotator/funcotator_merge.maf

这样每个样本就得到一个新的 maf 文件，当然也拿到了最后的合并的 funcotator_merge.maf，新的 maf 文件里面的每一条记录都增加了 Barcode 的信息，如 case1_biorep_A_techrep_funcotator.maf：

$ head -1 ./7.annotation/funcotator/case1_biorep_A_techrep_funcotator.maf 
CHD5    26038    __UNKNOWN__    hg38    chr1    6146376    6146376    +    Missense_Mutation    SNP    G    G    T    __UNKNOWN__    __UNKNOWN__    case1_biorep_A_techrep    case1_germline    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN____UNKNOWN__    NA    NA    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    g.chr1:6146376G>T    ENST00000262450.7    -    11    1738    c.1638C>A    c.(1636-1638)aaC>aaA    p.N546K    NM_015557.2    NP_056372    O00258    WRB_HUMAN  chromodomain    helicase    DNA    binding    protein    5    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    tail-anchored    membrane   protein    insertion    into    ER    membrane    (GO:0071816)    endoplasmicreticulum    (GO:0005783)|integral    component    of    membrane    (GO:0016021)|nucleus    (GO:0005634)    biliary_tract(2)|breast(48)|central_nervous_system(44)|kidney(92)|large_intestine(37)|lung(1)|pancreas(22)|upper_aerodigestive_tract(120)    366    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__CATCCATGTCGTTCTTTCTTT    0.5860349127182045    __UNKNOWN__    __UNKNOWN__    __UNKNOWN__    0.058    3    97    0    75    AL117491    HGNC:16816    Approved    gene    with    protein    product    protein-coding    gene    1p36.31    2016-10-05 AF425231    ENSG00000116254    11889561,    12592387    NM_015557    1305|88    NuRcomplex|PHD    finger    proteins    CCDS57    OTTHUMG00000000952    610771    NM_015557    Q8TDI0    ENSG00000116254    uc001amb.3    A8KAP8|A8MQ44|D3DSH9|O60740    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    false    26038    false    184    1    22.56    -1.888e+00    1.000e-05    0.00    -4.514e+01    5.41

在第 9 列和第 10 列分别记录了 Variant_Classification 和 Variant_Type，我们也可以适当统计一下：

$less ./7.annotation/funcotator/case1_biorep_A_techrep_funcotator.maf | cut -f 9 | sort | uniq -c
      1 3'UTR
      4 Frame_Shift_Del
      1 In_Frame_Del
     12 Intron
    142 Missense_Mutation
     14 Nonsense_Mutation
     60 Silent
      1 Splice_Site
      1 START_CODON_SNP
$ less ./7.annotation/funcotator/case1_biorep_A_techrep_funcotator.maf| cut -f 10 | sort | uniq -c
      5 DEL
      1 DNP
    230 SNP

这里有一个 DNP，我们就顺便解释一下关于几种变异类型：

INS - The variant allele is some kind of insertion.
DEL - The variant allele is some kind of deletion.
SNP - The variant allele is a single nucleotide polymorphism.
DNP - The variant allele is a di-nucleotide polymorphism.
TNP - The variant allele is a tri-nucleotide polymorphism.
ONP - The variant allele is an oligo-nucleotide polymorphism (Synonymous with MNP).
MNP - The variant allele is a multi-nucleotide polymorphism (Synonymous with ONP).
NA - The variant allele type cannot be determined.

VEP

VEP 的注释结果拿到了 vcf 文件，也可以转成 maf 文件，不过要用到一个软件，叫 vcf2maf，安装和使用的方法见曾老师教程：把vcf文件转换为maf格式或者官网 vcf2maf：https://github.com/mskcc/vcf2maf，也很简单：

export VCF2MAF_URL=`curl -sL https://api.github.com/repos/mskcc/vcf2maf/releases | grep -m1 tarball_url | cut -d\" -f4`
curl -L -o mskcc-vcf2maf.tar.gz $VCF2MAF_URL
tar -zxf mskcc-vcf2maf.tar.gz
cd mskcc-vcf2maf-*

转换格式主要用到的是一个软件的一个 perl 脚本vcf2maf.pl，同样的我们还是写一个 shell 脚本调用 perl 程序，shell 脚本 vcf2maf.sh 如下，输入文件为 vep 注释后的 vcf，输出文件为 maf 文件：

cat config | while read id
do
    sample=./7.annotation/vep/${id}_vep.vcf
    normal=${id%%_*}_germline
    perl ~/biosoft/vcf2maf/vcf2maf.pl \
    --input-vcf  ${sample}   \
    --output-maf ./7.annotation/vep/${id}_vep.maf  \
    --ref-fasta ~/wes_cancer/data/Homo_sapiens_assembly38.fasta \
    --tumor-id  ${id}  \
    --normal-id ${normal}  \
    --ncbi-build GRCh38
done

最后得到的 maf 文件，在第 9 列和第 10 列分别记录了 Variant_Classification 和 Variant_Type，同样查看一下样本 case1_biorep_A_techrep 转换后的结果：

$ less ./7.annotation/vep/case1_biorep_A_techrep_vep.maf |cut -f 9|sort |uniq -c
      4 Frame_Shift_Del
      1 In_Frame_Del
    153 Missense_Mutation
     15 Nonsense_Mutation
     59 Silent
      3 Splice_Region
      1 Translation_Start_Site
      1 Variant_Classification
$less ./7.annotation/vep/case1_biorep_A_techrep_vep.maf | cut -f 10 | sort | uniq -c
      5 DEL
      1 DNP
    230 SNP

注释后的 maf 文件的记录和原先的 vcf 文件记录的条目是一致的，都是 236 条记录。

最后也是把所有的 maf 文件合并到一起，拿到 VEP_merge.maf

cat ./7.annotation/vep/*maf | grep -v '^#'| grep -v '^Hugo_Symbol' >./7.annotation/vep/tmp 
grep 'Hugo_Symbol' case1_biorep_A_techrep_vep.maf >./7.annotation/vep/header
cat ./7.annotation/vep/header ./7.annotation/vep/tmp >./7.annotation/vep/vep_merge.maf

SNPEFF

SNPEFF 的注释结果和 vep 的注释结果非常相似，都是在原来的 vcf 文件的 INFO 列添加了许多注释信息。但是，vcf2maf 似乎不支持 snpeff 的注释结果，无法转成 maf，搜索了很久也都没有找到有效的方法，按理来说稍稍改一下 vcf2maf.pl 的脚本就可以了，但是我没学过 perl，面对上千行的perl 脚本，无从下手，不知道各位是否已经造了轮子，如果有的话，希望可以邮件分享一下 Nickier0510@163.com，谢谢~
（注：评论区唐医生给出了 snpeff 注释后的 vcf 转为 maf 格式的解决方案）