在使用kallisto等基于cDNA数据库注释软件时,我们获得的gene expression matrix是transcript id,这需要进行id转换。

加载R包

  1. # devtools::install_github("BUStools/BUSpaRse")
  2. library(BUSpaRse)

获取数据

国内访问Ensembl较为困难,内置biomart包可以设置镜像,有时间可以修改一下。

  1. tr2g <- transcript2gene(
  2. c("Homo sapiens", "Mus musculus"),
  3. type = "vertebrate",
  4. ensembl_version = 100,
  5. kallisto_out_path = "./")

How to get the mapping relationship between transcriptid and geneid - 图1

biomart方法获取

上述方法也是基于biomart包访问Ensembl接口,直接用biomart可能更好。

获取基因组的基因list

  1. cat Homo_sapiens.GRCh38.101.gtf | awk -F'\t' '{if($3=="gene") {split($9,a,";"); print a[1]"\t"$5-$4};}' | sed 's/[gene_id |"|]//g' | sort -u > Homo_sapiens.GRCh38.101.genelength.tsv
  1. library(biomaRt)
  2. library(curl)
  3. genelist <- read.table("Homo_sapiens.GRCh38.101.genelength.tsv", header = T)
  4. human_mart <- useMart(host="www.ensembl.org",
  5. biomart="ENSEMBL_MART_ENSEMBL",
  6. dataset = "hsapiens_gene_ensembl")
  7. human_gene_all <- getBM(attributes=c("ensembl_gene_id",
  8. "entrezgene_id",
  9. "external_gene_name",
  10. "ensembl_transcript_id",
  11. "ensembl_transcript_id_version",
  12. "transcript_biotype",
  13. "description"),
  14. filters="ensembl_gene_id",
  15. values = genelist$Geneid,
  16. mart=human_mart)

Reference

  1. transcript2gene usage
  2. BUSpaRse github