做生信的童鞋想要学习 Docker,或者使用 Docker+Pipeline 封装自己的一套数据分析流程,相信一定不能错过胡博强老师在2017年写这篇《[Docker]使用阿里云 + Docker 分析高通量测序数据——RNA-Seq 与 ChIP-Seq. - Boqiang Hu》教程,这个教程同时也以推文的方式发布在了 2017-03-21 生信技能树公众号上,感兴趣的同学可以自己去翻一下。

根据教程+tangEpiNGSInstall 仓库提供的原始测试数据,本人这两天测试跑了一下,发现了一点点小问题。

  1. $ git clone https://github.com/shenweiyan/tangEpiNGSInstall.git
  2. $ tree
  3. .
  4. └── tangEpiNGSInstall
  5. ├── Dockerfile
  6. ├── README.md
  7. ├── settings
  8. ├── run_chipseq.py
  9. ├── run_chipseq.sh
  10. ├── run_mRNA.py
  11. ├── run_mRNA.sh
  12. ├── scripts_chipseq.py
  13. └── scripts_mRNA.py
  14. ├── src
  15. └── run_sample.sh
  16. ├── test_fq
  17. ├── H3K4me3
  18. ├── test.1.fq.gz
  19. └── test.2.fq.gz
  20. ├── Input
  21. ├── test.1.fq.gz
  22. └── test.2.fq.gz
  23. └── sample.tab.xls
  24. └── test_fq_RNA
  25. ├── SampleA1
  26. ├── test.1.fastq.gz
  27. └── test.2.fastq.gz
  28. └── sample.tab.xls
  29. 8 directories, 17 files
  30. $ mkdir -p results database_ChIP/mm10
  31. $ chmod 777 results database_ChIP/mm10 # avoiding Permission issue
  32. $ tree
  33. .
  34. ├── database_ChIP
  35. └── mm10
  36. ├── results
  37. └── tangEpiNGSInstall
  38. ├── Dockerfile
  39. ├── README.md
  40. ├── settings
  41. ├── run_chipseq.py
  42. ├── run_chipseq.sh
  43. ├── run_mRNA.py
  44. ├── run_mRNA.sh
  45. ├── scripts_chipseq.py
  46. └── scripts_mRNA.py
  47. ├── src
  48. └── run_sample.sh
  49. ├── test_fq
  50. ├── H3K4me3
  51. ├── test.1.fq.gz
  52. └── test.2.fq.gz
  53. ├── Input
  54. ├── test.1.fq.gz
  55. └── test.2.fq.gz
  56. └── sample.tab.xls
  57. └── test_fq_RNA
  58. ├── SampleA1
  59. ├── test.1.fastq.gz
  60. └── test.2.fastq.gz
  61. └── sample.tab.xls
  62. 11 directories, 17 files
  63. $ docker pull hubq/tanginstall:latest
  64. $ docker run -v /data/docker/train/tangEpiNGSInstall/test_fq:/fastq -v /data/docker/train/results:/home/analyzer/project -v /data/docker/train/database_ChIP/mm10:/home/analyzer/database_ChIP/mm10 -v /data/docker/train/tangEpiNGSInstall/settings/:/settings/ --env ref=mm10 --env type=ChIP hubq/tanginstall:latest
  65. INFO @ 2021-06-10 03:29:48,154: Begin checking input files.
  66. INFO @ 2021-06-10 03:29:48,154: Input database files were all put in /home/analyzer/database_ChIP/mm10.
  67. INFO @ 2021-06-10 03:29:48,154: Input fasta /home/analyzer/database_ChIP/mm10/mm10.fa not find. Now download from UCSC
  68. INFO @ 2021-06-10 03:45:01,604: /home/analyzer/database_ChIP/mm10/mm10.fa generation done!
  69. INFO @ 2021-06-10 03:45:01,605: Fasta were not indexed.
  70. INFO @ 2021-06-10 03:45:02,105: Now build index using bwa.
  71. INFO @ 2021-06-10 03:48:20,683: Building index done!
  72. INFO @ 2021-06-10 03:48:20,683: Genome GTF file were not found.
  73. INFO @ 2021-06-10 03:48:21,184: Now download refGene file from UCSC.
  74. INFO @ 2021-06-10 05:03:38,768: Generate refGene done!
  75. INFO @ 2021-06-10 05:03:38,768: RepeatMask file were not found.
  76. INFO @ 2021-06-10 05:03:39,269: Now download rmsk file from UCSC.
  77. INFO @ 2021-06-10 05:06:03,592: Generate RepeatMask done!
  78. Traceback (most recent call last):
  79. File "/home/analyzer/module/ChIP/run_chipseq.py", line 148, in <module>
  80. main()
  81. File "/home/analyzer/module/ChIP/run_chipseq.py", line 126, in main
  82. samp_peak.get_idr_stat()
  83. File "/home/analyzer/module/ChIP/frame/module02_call_peaks.py", line 244, in get_idr_stat
  84. mod_Stat.IDR_Stat()
  85. File "/home/analyzer/module/ChIP/frame/module00_StatInfo.py", line 113, in IDR_Stat
  86. f_idr_out = open(file_idr_out,"w")
  87. IOError: [Errno 2] No such file or directory: '/home/analyzer/project/ChIP_test/StatInfo/IDR_result./home/analyzer/project/ChIP_test/sample.tab.xls'
  88. cp: cannot stat `03.2.Peak_mrg/*/*_treat_minus_control.sort.norm.bw': No such file or directory
  89. cp: cannot stat `03.3.Peak_idr/*/*.conservative.regionPeak.gz*': No such file or directory
  90. cp: cannot stat `StatInfo/*': No such file or directory

docker-error.jpg

出于学习和折腾,针对这个问题,个人在 hubq/tanginstall:latest 的镜像基础上做了一点小调整,并重新打包成一个名为 shenweiyan/tanginstall:latest 的新镜像 push 到了 Docker Hub,抛砖引玉,供大家学习参考。

docker-chip.jpg

简单说一下这个镜像的几点细节。

  1. 整个镜像体积比较大,总共约 7.37GB,pull 下来可能比较慢。

    docker-images.jpg

  2. 如果没有 ref(hg19/hg38 or mm9/mm10),镜像执行过程中会首先执行下载,然后拆分合并,建立 index。

  • db01.DownloadRef.sh ``` $ cat db01.DownloadRef.sh ref=$1 dir_database=/home/analyzer/database_ChIP/$ref dir_path=/home/analyzer/module/ChIP

cd $dir_database

wget http://hgdownload.soe.ucsc.edu/goldenPath/${ref}/bigZips/chromFa.tar.gz

tar -zxvf $dir_database/chromFa.tar.gz

for i in {1..22} X Y M do cat $dir_database/chr$i.fa done >$dir_database/${ref}.fa && rm $dir_database/chr*fa

  1. - db02.RefIndex.sh

$ cat db02.RefIndex.sh ref=$1 dir_database=/home/analyzer/database_ChIP/$ref bwa_exe=/software/install_packages/bwa-0.7.5a/bwa samtools_exe=/software/install_packages/samtools-0.1.18/samtools div_bins_exe=/home/analyzer/module/ChIP/bin/div_bins/bed_read

$samtools_exe faidx $dir_database/${ref}.fa

$bwa_exe index $dir_database/${ref}.fa

$dix_bins_exe -b 100 $dir_database/${ref}.fa.fai $dir_database/columns.100.bed $dix_bins_exe -b 1000 $dir_database/${ref}.fa.fai $dir_database/columns.1kb.bed

cut -f 1-2 $dir_database/${ref}.fa.fai >$dir_database/${ref}.fa.len

  1. - db03.RefGene.sh

$ cat db03.RefGene.sh ref=$1 dir_database=/home/analyzer/database_ChIP/$ref bedtools_exe=/software/install_packages/bedtools2/bin/bedtools ucsc_dir=/software/install_packages/UCSC bin=/home/analyzer/module/ChIP/bin dir_path=/home/analyzer/module/ChIP

cd $dir_database wget http://hgdownload.soe.ucsc.edu/goldenPath/${ref}/database/refGene.txt.gz

remove chromosome fragments(unassembled).

for i in {1..22} X Y M do zcat $dir_database/refGene.txt.gz | grep -w chr$i done >$dir_database/tmp mv $dir_database/tmp $dir_database/refGene.txt

refGene.bed

cat $dir_database/refGene.txt |\ awk ‘{ tag=”noncoding”; if($4~/^NM/){tag=”protein_coding”}; OFS=”\t”; print $3,$5,$6,$2,$4,$10,$11,tag,$13 }’ /dev/stdin |\ python $bin/s03_genePred2bed.py /dev/stdin |\ $bedtools_exe sort -i /dev/stdin >$dir_database/refGene.bed &&\

region.Intragenic.bed

For novo lncRNA detection

$bin/find_ExonIntronIntergenic/find_ExonIntronIntergenic \ $dir_database/refGene.bed \ $dir_database/${ref}.fa.fai >$dir_database/pos.bed &&\

grep -v “Intergenic” $dir_database/pos.bed |\ awk ‘{OFS=” “;print $1,$2,$3,”Intragenic”}’ /dev/stdin \

  1. >$dir_database/region.Intragenic.bed &&\

refGene.gtf

For mapping

zcat $dir_database/refGene.txt.gz |\ cut -f 2- |\ $ucsc_dir/genePredToGtf file stdin /dev/stdout |\ grep -w exon |\ $bedtools_exe sort -i /dev/stdin >$dir_database/refGene.gtf &&\ cat $dir_path/database/ERCC.gtf >>$dir_database/refGene.gtf

  1. - db04.rmsk.sh

$ cat db04.rmsk.sh ref=$1 dir_database=/home/analyzer/database_ChIP/$ref bedtools_exe=/software/install_packages/bedtools2/bin/bedtools ucsc_dir=/software/install_packages/UCSC bin=/home/analyzer/module/ChIP/bin dir_path=/home/analyzer/module/ChIP

cd $dir_database wget http://hgdownload.soe.ucsc.edu/goldenPath/${ref}/database/rmsk.txt.gz

zcat $dir_database/rmsk.txt.gz |\ awk ‘{ OFS=”\t”; print $6,$7,$8,$2,”.”,”.”,”.”,”(“$9”)”,$10,$11,$12 “/“ $13,$14,$15,$16,$17 }’ /dev/stdin |\ tail -n +2 /dev/stdin >$dir_database/chrom.bed

for i in {1..22} X Y M do grep -w chr$i $dir_database/chrom.bed done >$dir_database/tmp mv $dir_database/tmp $dir_database/chrom.bed

$bedtools_exe sort -i $dir_database/chrom.bed >$dir_database/chrom.sort.bed

  1. 3. 为节省下载时间,建议事先准备好 ${ref}.fa,如果没有,也可以先下载好以下文件。

db01.DownloadRef.sh:

wget http://hgdownload.soe.ucsc.edu/goldenPath/${ref}/bigZips/chromFa.tar.gz

db03.RefGene.sh:

wget http://hgdownload.soe.ucsc.edu/goldenPath/${ref}/database/refGene.txt.gz

db04.rmsk.sh:

wget http://hgdownload.soe.ucsc.edu/goldenPath/${ref}/database/rmsk.txt.gz ```

  1. bwa index(db02.RefIndex.sh)非常耗时,个人一个4核16G配置的服务器也跑了大约2.5小时。