Python软件包,用于与SRAdb进行交互并从SRA下载数据集。官方文档

命令行用法

pysradb支持命令行用法。文档正在进行中。有关一些快速使用说明,请参见cmdline。有关每个子命令的说明列表,请参见快速入门。

  1. $ pysradb
  2. usage: pysradb [-h] [--version] [--citation]
  3. {metadb,metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
  4. ...
  5. pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
  6. version: 0.9.0.
  7. Citation: 10.12688/f1000research.18676.1
  8. optional arguments:
  9. -h, --help show this help message and exit
  10. --version show program's version number and exit
  11. --citation how to cite
  12. subcommands:
  13. {metadb,metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
  14. metadb Download SRAmetadb.sqlite
  15. metadata Fetch metadata for SRA project (SRPnnnn)
  16. download Download SRA project (SRPnnnn)
  17. search Search SRA for matching text
  18. gse-to-gsm Get GSM for a GSE
  19. gse-to-srp Get SRP for a GSE
  20. gsm-to-gse Get GSE for a GSM
  21. gsm-to-srp Get SRP for a GSM
  22. gsm-to-srr Get SRR for a GSM
  23. gsm-to-srs Get SRS for a GSM
  24. gsm-to-srx Get SRX for a GSM
  25. srp-to-gse Get GSE for a SRP
  26. srp-to-srr Get SRR for a SRP
  27. srp-to-srs Get SRS for a SRP
  28. srp-to-srx Get SRX for a SRP
  29. srr-to-gsm Get GSM for a SRR
  30. srr-to-srp Get SRP for a SRR
  31. srr-to-srs Get SRS for a SRR
  32. srr-to-srx Get SRX for a SRR
  33. srs-to-gsm Get GSM for a SRS
  34. srs-to-srx Get SRX for a SRS
  35. srx-to-srp Get SRP for a SRX
  36. srx-to-srr Get SRR for a SRX
  37. srx-to-srs Get SRS for a SRX

Downloading SRAmetadb (optional)

pysradb可以利用SQLite数据库文件,该文件具有SRAdb项目提供的经过预处理的元数据。但是,对于0.9.5版,此数据库文件对任何操作都不是硬性要求。
SRAmetadb can be downloaded using:

  1. wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz

另外,您也可以使用pysradb下载它,默认情况下会将其下载到当前工作目录中:

  1. pysradb metadb

使用pysradb

Mode: SRAmetadb or SRAWeb

pysradb的初始版本完全取决于SRAdb项目提供的SRAmnetadb.sqlite文件,我们将其称为SRAmetadb模式。但是,在`pysradb 0.9.5中,对SQLite文件的依赖已变为可选。在没有SQLite文件的情况下,操作使用usiNCBi的esrarch和esummary界面执行,该模式称为SRAweb模式。可以通过下载SQLite文件执行除搜索以外的所有操作。注意:SRAweb模式当前不完全支持附加标志,例如—desc,-detailed和-expand,将来的发行版中将支持这些标志。但是,在SRAweb和SRAmetadb模式下都可以使用将一个ID相互转换为另一个ID的所有基本功能。

Search

搜索所有包含“核糖体分析”的项目:
可以添加本地的数据库的命令。

  1. pysradb search "ribosome profiling" | head

获取SRA元数据

  1. pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head
  2. study_accession experiment_accession sample_accession run_accession library_strategy batch biomaterial_provider biomaterial_type cell_type collection_method differentiation_method differentiation_stage disease donor_age donor_ethnicity donor_health_status donor_id donor_sex line lineage medium molecule passage sample_term_id sex source_name tissue tissue_depot tissue_type
  3. SRP000941 SRX006235 SRS004118 SRR018454 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
  4. SRP000941 SRX006236 SRS004118 SRR018456 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
  5. SRP000941 SRX006237 SRS004118 SRR018455 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
  6. SRP000941 SRX006239 SRS004213 SRR019072 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
  7. SRP000941 SRX006239 SRS004213 SRR019080 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
  8. SRP000941 SRX006239 SRS004213 SRR019081 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
  9. SRP000941 SRX006239 SRS004213 SRR019082 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
  10. SRP000941 SRX006239 SRS004213 SRR019083 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
  11. SRP000941 SRX006239 SRS004213 SRR019084 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN

将SRP转换为GSE

  1. $ pysradb srp-to-gse SRP075720
  2. study_accession study_alias
  3. SRP075720 GSE81903

注:更多转换见官方文档。

下载整个项目

Pysradb 使从 SRA 下载数据集变得非常容易。

  1. $ pysradb download --out-dir ./pysradb_downloads -p SRP063852

下载由 SRP/SRX/SRR 模仿 SRA 项目的层次结构组织。

仅下载某些感兴趣的样本

  1. pysradb metadata SRP000941 --assay | grep 'study\|RNA-Seq' | pysradb download

如果可用,将使用aspera-client下载来自该项目的所有RNA-seq样本。另外,它也可以使用wget。

Python代码使用

用例1:获取元数据表(SRA运行表)

pysradb最简单的用例是,当您事先知道SRA项目ID(SRP)并只是想获取与其关联的元数据时。这通常反映在您从NCBI网站获得的SraRunTable.txt中。请参见SraRunTable的示例。

  1. from pysradb import SRAdb
  2. db = SRAdb('SRAmetadb.sqlite')
  3. df = db.sra_metadata('SRP098789')
  4. df.head()

用例2:明智地下载整个项目并安排实验

一旦获取了元数据并确保这是您正在寻找的项目,您将希望立即下载所有内容。 NCBI遵循以下层次结构:SRP => SRX => SRR。每个SRP(项目)具有多个SRX(实验),每个SRX内又具有多个SRR(运行)。我们想在我们的下载文件中模仿这个层次。这样做的原因很简单:在大多数情况下,您最关心SRX,并且希望以一种或另一种方式“合并”您的SRR。具有此层次结构可确保您的下游代码可以轻松处理此类情况,而无需担心需要合并哪些运行(SRR)。

  1. from pysradb import SRAdb
  2. db = SRAdb('SRAmetadb.sqlite')
  3. df = db.sra_metadata('SRP017942')
  4. db.download(df)

用例3:下载实验的子集

通常,您只需要处理项目(SRP)中的一小部分样本。考虑这个项目,该项目的数据涵盖了四种检测方法。

  1. df = db.sra_metadata('SRP000941')
  2. print(df.library_strategy.unique())
  3. ['ChIP-Seq' 'Bisulfite-Seq' 'RNA-Seq' 'WGS' 'OTHER']

但是,您可能只对分析RNA-seq样本感兴趣,而只想下载该子集。使用pysradb很简单,因为元数据可以是子集,就像您可以将pandas中的数据子集化一样。

  1. df_rna = df[df.library_strategy == 'RNA-Seq']
  2. db.download(df=df_rna, out_dir='/pysradb_downloads')()

用例4:获取细胞类型/治疗信息表格样本属性

细胞类型/组织信息通常隐藏在sample_attributes列中,该列可以扩展:

  1. from pysradb.filter_attrs import expand_sample_attribute_columns
  2. df = db.sra_metadata('SRP017942')
  3. expand_sample_attribute_columns(df).head()

用例5:搜索数据集

我们在SRA上执行的另一种常见操作是搜索,纯文本搜索。

如果要查找描述中在某处出现核糖体谱的所有项目:

  1. df = db.search_sra(search_str='"ribosome profiling"')
  2. df.head()