前言

自己写的好几种算法企图实现bedtools的功能,虽然julia性能足够好,但都难以在效率上达到bedtools的性能,于是最后只能借助轮子了。

代码

  1. using CSV
  2. using DataFrames
  3. using Tables
  4. using GenomicFeatures
  5. function fs2NameTuple(fs::String)
  6. query = DataFrame(CSV.File(fs, delim="\t", header=0))
  7. if size(query)[2] == 5
  8. query = rename!(query, :Column1 => :Chromosome, :Column2 => :Start, :Column3 => :End, :Column4 => :Name, :Column5 => :Score)
  9. # query = sort!(query, [:Chromosome,:Start])
  10. query = Tables.rowtable(query)
  11. result = IntervalCollection([Interval(qy.Chromosome, qy.Start, qy.End, '?', qy.Name) for qy in query], true)
  12. result
  13. elseif size(query)[2] == 4
  14. query = rename!(query, :Column1 => :Chromosome, :Column2 => :Start, :Column3 => :End, :Column4 => :Name)
  15. # query = sort!(query, [:Chromosome,:Start])
  16. query = Tables.rowtable(query)
  17. result = IntervalCollection([Interval(qy.Chromosome, qy.Start, qy.End, '?', qy.Name) for qy in query], true)
  18. result
  19. else
  20. println("please confirm your bed files")
  21. end
  22. end
  23. function getInterval(A::IntervalCollection, B::IntervalCollection)
  24. for i in eachoverlap(A, B)
  25. println(i[1].seqname, "\t", i[1].first, "\t", i[1].last, "\t", i[1].metadata, "\t", i[2].seqname, "\t", i[2].first, "\t", i[2].last, "\t", i[2].metadata)
  26. end
  27. end
  28. getInterval(fs2NameTuple("test1.bed"),fs2NameTuple("test2.bed"))

原理

首先通过CSV将bed文件读入内存,然后将dataFrame转化为nametuple,最后在转化为intervalCollection
通过内置的overlap功能可以几乎瞬间获取所有的交集,性能已经达到了c语言的程度。