Python软件包,用于与SRAdb进行交互并从SRA下载数据集。官方文档
命令行用法
pysradb支持命令行用法。文档正在进行中。有关一些快速使用说明,请参见cmdline。有关每个子命令的说明列表,请参见快速入门。
$ pysradb
usage: pysradb [-h] [--version] [--citation]
{metadb,metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
...
pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
version: 0.9.0.
Citation: 10.12688/f1000research.18676.1
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--citation how to cite
subcommands:
{metadb,metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
metadb Download SRAmetadb.sqlite
metadata Fetch metadata for SRA project (SRPnnnn)
download Download SRA project (SRPnnnn)
search Search SRA for matching text
gse-to-gsm Get GSM for a GSE
gse-to-srp Get SRP for a GSE
gsm-to-gse Get GSE for a GSM
gsm-to-srp Get SRP for a GSM
gsm-to-srr Get SRR for a GSM
gsm-to-srs Get SRS for a GSM
gsm-to-srx Get SRX for a GSM
srp-to-gse Get GSE for a SRP
srp-to-srr Get SRR for a SRP
srp-to-srs Get SRS for a SRP
srp-to-srx Get SRX for a SRP
srr-to-gsm Get GSM for a SRR
srr-to-srp Get SRP for a SRR
srr-to-srs Get SRS for a SRR
srr-to-srx Get SRX for a SRR
srs-to-gsm Get GSM for a SRS
srs-to-srx Get SRX for a SRS
srx-to-srp Get SRP for a SRX
srx-to-srr Get SRR for a SRX
srx-to-srs Get SRS for a SRX
Downloading SRAmetadb (optional)
pysradb可以利用SQLite数据库文件,该文件具有SRAdb项目提供的经过预处理的元数据。但是,对于0.9.5版,此数据库文件对任何操作都不是硬性要求。
SRAmetadb can be downloaded using:
wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
另外,您也可以使用pysradb下载它,默认情况下会将其下载到当前工作目录中:
pysradb metadb
使用pysradb
Mode: SRAmetadb or SRAWeb
pysradb的初始版本完全取决于SRAdb项目提供的SRAmnetadb.sqlite文件,我们将其称为SRAmetadb模式。但是,在`pysradb 0.9.5中,对SQLite文件的依赖已变为可选。在没有SQLite文件的情况下,操作使用usiNCBi的esrarch和esummary界面执行,该模式称为SRAweb模式。可以通过下载SQLite文件执行除搜索以外的所有操作。注意:SRAweb模式当前不完全支持附加标志,例如—desc,-detailed和-expand,将来的发行版中将支持这些标志。但是,在SRAweb和SRAmetadb模式下都可以使用将一个ID相互转换为另一个ID的所有基本功能。
Search
搜索所有包含“核糖体分析”的项目:
可以添加本地的数据库的命令。
pysradb search "ribosome profiling" | head
获取SRA元数据
pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head
study_accession experiment_accession sample_accession run_accession library_strategy batch biomaterial_provider biomaterial_type cell_type collection_method differentiation_method differentiation_stage disease donor_age donor_ethnicity donor_health_status donor_id donor_sex line lineage medium molecule passage sample_term_id sex source_name tissue tissue_depot tissue_type
SRP000941 SRX006235 SRS004118 SRR018454 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006236 SRS004118 SRR018456 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006237 SRS004118 SRR018455 ChIP-Seq NaN cellular dynamics international cell line NaN NaN none none none NaN NaN NaN NaN NaN h1 embryonic stem cell mteser genomic dna between 30 and 50 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019072 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019080 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019081 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019082 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019083 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
SRP000941 SRX006239 SRS004213 SRR019084 Bisulfite-Seq #2 thomson laboratory cell line NaN NaN na embryonic stem cell none NaN NaN NaN NaN NaN h1 na tesr genomic dna 27 efo_0003042 male NaN NaN NaN NaN
将SRP转换为GSE
$ pysradb srp-to-gse SRP075720
study_accession study_alias
SRP075720 GSE81903
下载整个项目
Pysradb 使从 SRA 下载数据集变得非常容易。
$ pysradb download --out-dir ./pysradb_downloads -p SRP063852
下载由 SRP/SRX/SRR 模仿 SRA 项目的层次结构组织。
仅下载某些感兴趣的样本
pysradb metadata SRP000941 --assay | grep 'study\|RNA-Seq' | pysradb download
如果可用,将使用aspera-client下载来自该项目的所有RNA-seq样本。另外,它也可以使用wget。
Python代码使用
用例1:获取元数据表(SRA运行表)
pysradb最简单的用例是,当您事先知道SRA项目ID(SRP)并只是想获取与其关联的元数据时。这通常反映在您从NCBI网站获得的SraRunTable.txt中。请参见SraRunTable的示例。
from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP098789')
df.head()
用例2:明智地下载整个项目并安排实验
一旦获取了元数据并确保这是您正在寻找的项目,您将希望立即下载所有内容。 NCBI遵循以下层次结构:SRP => SRX => SRR。每个SRP(项目)具有多个SRX(实验),每个SRX内又具有多个SRR(运行)。我们想在我们的下载文件中模仿这个层次。这样做的原因很简单:在大多数情况下,您最关心SRX,并且希望以一种或另一种方式“合并”您的SRR。具有此层次结构可确保您的下游代码可以轻松处理此类情况,而无需担心需要合并哪些运行(SRR)。
from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP017942')
db.download(df)
用例3:下载实验的子集
通常,您只需要处理项目(SRP)中的一小部分样本。考虑这个项目,该项目的数据涵盖了四种检测方法。
df = db.sra_metadata('SRP000941')
print(df.library_strategy.unique())
['ChIP-Seq' 'Bisulfite-Seq' 'RNA-Seq' 'WGS' 'OTHER']
但是,您可能只对分析RNA-seq样本感兴趣,而只想下载该子集。使用pysradb很简单,因为元数据可以是子集,就像您可以将pandas中的数据子集化一样。
df_rna = df[df.library_strategy == 'RNA-Seq']
db.download(df=df_rna, out_dir='/pysradb_downloads')()
用例4:获取细胞类型/治疗信息表格样本属性
细胞类型/组织信息通常隐藏在sample_attributes列中,该列可以扩展:
from pysradb.filter_attrs import expand_sample_attribute_columns
df = db.sra_metadata('SRP017942')
expand_sample_attribute_columns(df).head()
用例5:搜索数据集
我们在SRA上执行的另一种常见操作是搜索,纯文本搜索。
如果要查找描述中在某处出现核糖体谱的所有项目:
df = db.search_sra(search_str='"ribosome profiling"')
df.head()