- Coding Exercise
- LeviTech - DNA Motif Finder
- Enter the Organism name (e.g., Ecoli): Ecoli
Attempt to Load Protein Table (list of genes) for organism: Ecoli
Attempt to open filename: Ecoli.ptt
Number of lines (excl. header) in Ecoli.ptt is: 4289
Protein table Ecoli.ptt successfully opened and loaded.
Attempt to Load DNA for organism: Ecoli
Attempt to open filename: Ecoli.fna
Number of lines of DNA (excl. header) in Ecoli.fna is: 54425
DNA Ecoli.fna successfully opened and loaded.
What motif would you like to find >> CCCGCACCTGACAGTG
CCCGCACCTGACAGTG found at bp 280
In between genes:
UPSTREAM Name: thrL Start: 190 End: 255 PID: 1786182
and
DOWNSTREAM Name: thrA Start: 337 End: 2799 PID: 1786183 - Do you want to search for another motif? (y/n): n
- Enter the Organism name (e.g., Ecoli): Ecoli
- Do you want to try another organism? (y/n): n
Coding Exercise
Create a program, which, given a valid sequence of rolls for one line of American Ten-Pin Bowling, produces the total score for the game.
We can briefly summarize the scoring for this form of bowling:
- Each game, or “line” of bowling, includes ten turns, or “frames” for the bowler. (每个游戏包括10局或者场景)
- In each frame, the bowler gets up to two tries to knock down all the pins. (每一局,可以有两次尝试击倒所有瓶子的机会)
- If in two tries, he fails to knock them all down, his score for that frame is the total number of pins knocked down in his two tries. (在两次尝试中,如果没有全部击倒,那么他的得分就是两次一共击倒的瓶子个数)
- If in two tries he knocks them all down, this is called a “spare” and his score for the frame is ten plus the number of pins knocked down on his next throw (in his next turn). (如果两次尝试击倒了所有的瓶子,这种情况被叫做”spare” 并且他的得分为10 + 下一局中击倒瓶子的个数)
- If on his first try in the frame he knocks down all the pins, this is called a “strike”. His turn is over, and his score for the frame is ten plus the simple total of the pins knocked down in his next two rolls. (如果他在第一次尝试的时候击倒了所有的瓶子,这被叫做”strike”,轮到他的回合结束了,他的分局是10分加上他接下来两次投球中击倒的木瓶的总数。)
- If he gets a spare or strike in the last (tenth) frame, the bowler gets to throw one or two more bonus balls, respectively. These bonus throws are taken as part of the same turn. If the bonus throws knock down all the pins, the process does not repeat: the bonus throws are only used to calculate the score of the final frame. (在最后一句的时候如果他得了一个spare或者strike,投球手可以再投一个或者两个加成球,这些加值投掷被视为同一回合的一部分。如果奖励掷下了所有的木瓶,这个过程不会重复:奖励掷出只用于计算最后一格的分数。)
- The game score is the total of all frame scores.(游戏最后的分数是所有局的总分数)
Invalid input should be handled appropriately. (无效的输入应该有适当的处理)
Notes:
When scoring “X” indicates a strike, “/” indicates a spare, “-” indicates a miss (成绩X 表示一个strike “/“表示一个spare, -表示没有击倒)
- X X X X X X X X X X X X (12 rolls: 12 strikes) = 10 frames * 30 points = 300
- 9- 9- 9- 9- 9- 9- 9- 9- 9- 9- (20 rolls: 10 pairs of 9 and miss) = 10 frames * 9 points = 90
- 5/ 5/ 5/ 5/ 5/ 5/ 5/ 5/ 5/ 5/5 (21 rolls: 10 pairs of 5 and spare, with a final 5) = 10 frames * 15 points = 150
Rules for submission: (提交规则)
- Your submission must run without errors and be compiled and built by us. (您的提交必须运行没有错误,并由我们编译和建立。)
- Ensure your test coverage is sufficient and validates your program against sample input (确保您的测试覆盖率是充分的,并根据示例输入验证您的程序)
- You are not permitted to use any external libraries to solve this problem. You may use external libraries in your tests only, such as a mocking framework and related components. (您不允许使用任何外部库来解决此问题。您只能在测试中使用外部库,如mock框架和相关组件)
- Use an appropriate build tool – Maven, Gradle etc (使用合适的构建工具——Maven, Gradle等等)
- We are looking at the way you design and write your code using sound principles and conventions. (我们正在研究如何使用合理的原则和惯例来设计和编写代码。)
- Clean code is especially important to the way we build software(干净的代码对于我们构建软件的方式尤为重要)
LeviTech - DNA Motif Finder
This exercise requires you to implement a set of classes that provide an interface for scientists who wish to search for motifs within a genome.Using the class interface to IOrganism and IGene, your (main) program will prompt the user for an organism, open the DNA and PTT files for that
organism, continually prompt the user for their “favorite motifs”, each time searching the DNA for the user’s “favourite motif” and reporting the
initial occurrence of that motif (if any) as well as the closest “upstream” and “downstream” gene.(这个联系需要你你创建一个类,这个类提供一个接口提供给寻找基因图形的科学家, 使用接口类 IOrganism 和 IGene,你的主程序将会提供给用户一个有机体参数(organism),打开DNA和PTT文件为了识别这个有机体,每次搜索DNA给用户最想要的片段 以及与它最近的上下游的基因片段)
The exercise will test your knowledge of basic OOP technques such as modelling and encapsulation, string processing, and dynamic memory management.(这个练习是测试你的基本的面向对象编程的技能,例如建模和封装,线程,动态存储管理的知识)
As you are now aware, the DNA for many genomes exists in (text) files, for example, here are the first few lines of the genome for E.coli (the K12 type of E.coli) from Ecoli.fna:(正如你意识到的,例如下面的大肠杆菌的基因序列,很多DNA基因组存在在文件当中)
gi|16127994|ref|NC_000913.1| Escherichia coli K12, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
Once adequately stored in a data structure, you can search for motifs (words) of DNA in much the same manner that you might search for words in a story, play, or poem. DNA is double-stranded. One strand is called the “direct strand” (+) and the other strand is called the “complementary strand” (-). Imagine that the two strands of a genome are split apart as shown below. Because both strands are “complements” of each other, only the DNA in the direct(+) strand is actually stored in the .fna file. Thus, the nucleotides (letters of DNA) that you see in Ecoli.fna file is the DNA on the direct strand only. Imagine that the direct (+) strand of DNA is stretched into a single sequence. Now imagine that this single sequence of nucleotides is packed 70 per line (followed by a newline) into a file. Thus, the DNA for a genome is stored in a .fna file (e.g., Ecoli.fna) with a starting header line beginning with the character “>” followed by 70 nucleotides per line are stored in FASTA format. See the diagram below.(一旦能够再数据结构中有效的存储,你可以以基本相似的方式搜索到这个DNA对应的所有基因片段,就像你可以再故事,戏剧或者诗里面搜索文字一样。DNA是双螺旋结构,一个链被称为”direct strand”(+)另一个链称之为 “complementary strand”(-)。想象一个基因序列的两条链被分开就像下面展示的一样,因为两条链互相之间是互补的,只有DNA的direct(+)链实际上存储在基因序列文件里面的, 因此,你刚才看到的大肠杆菌的DNA文件里面的核苷酸只是DNA正向链上的,假设一下这个direct(+)链被演唱至一个单独的序列当中,现在,假设这个核苷酸序列每70个一行(后跟一个换行符)被打包到一个文件中,因此,一个组织序列的DNA被打包到一i个.fna文件里面用一个”>”作为行开头)
In addition to a .fna file that contains all the DNA, databases of genomes also provide other files. One of those files is called the “protein table”; this file ends with a .ptt file extension. The protein table (or gene table; in this lingo, “protein” = “gene”) is a file that lists all the information about the genes (remember, gene = proteins) that are found in a genome. What is not immediately obvious is that not all of a genome is made up of
genes. In fact, in the human genome, only about 10-15% of the DNA makes up the genes; the remaining 85-90% of the DNA is “between the genes” or intergenic DNA. In the diagram below, the sequence in bold are genes; the other sequence is “between the genes”.(除了包含所有DNA的。fna文件外,基因组数据库还提供其他文件。其中一个文件叫做“蛋白质表”;该文件以.ptt文件扩展名结尾。蛋白质表(或基因表)在这个术语中,“蛋白质”=“基因”)是一个文件,它列出了在基因组中发现的关于基因(记住,基因=蛋白质)的所有信息。现在还不清楚的是,并不是所有的基因组都是由基因组成的。事实上,在人类基因组中,只有10-15%的DNA组成了基因;剩下的85-90%的DNA是“基因间”或基因间DNA。在下面的图表中,加粗的是基因序列;另一个序列是“基因之间”。)
Each line in the protein table for a genome (e.g., Ecoli.ptt) tells us exactly where in the genome (where in the Ecoli.fna file) a particular gene starts, ends, which strand of the DNA double-helix where this gene was found, the protein length of the gene (that is, once it is converted from DNA to a protein), the protein identification number (or pid), the gene name, etc. The last portion of a line gives a short description of what we think that gene produces for a product, e.g., we believe the gene named “thrB” produces the particular protein product “homoserine kinase”. The diagram below shows the first three lines of Ecoli.ptt; thus, each line of information represents the information for one gene. Notice that the very first gene (thrL) starts at position (or base pair, bp) 190 and ends at base pair position) 255. Also note that since nucleotides (bases) are one letter each, each nucleotide takes up two bits in memory. Thus a position of 190 in the file means that the start of this gene can be found 190 bits from the start of the genome (not counting the header line of course).(一个基因组的蛋白质表中的每一行(例如,Ecoli.ptt)都告诉我们基因组的确切位置(Ecoli的确切位置)。(fna文件)一个特定的基因开始,结束,DNA双螺旋的哪一条链,这个基因在哪里被发现,基因的蛋白质长度(即,一旦它被从DNA转换成蛋白质),蛋白质鉴定编号(或pid),基因名称,等等。文章的最后一部分简要描述了我们认为基因会产生的一种产品,例如,我们认为名为“thrB”的基因会产生一种特殊的蛋白质产品“同丝氨酸激酶”。下面的图表显示了Ecoli.ptt的前三条线;因此,每一行信息代表一个基因的信息。注意第一个基因(thrL)开始于碱基对位置(bp) 190,结束于碱基对位置(255)。还要注意,因为核苷酸(碱基)每个是一个字母,每个核苷酸在内存中占用两个比特。因此,在文件中190的位置意味着该基因的起始位置可以从基因组的起始位置找到190位(当然不包括头行)。)
Escherichia coli K12, complete genome. - 1..3809716
4140 proteins
Location Strand Length PID Gene Synonym Code COG Product
190..255 + 21 1786182 thrL b0001 - - thr operon leader peptide
337..2799 + 820 1786183 thrA b0002 - - Bifunctional aspartokinase/homoserine dehydrogenas
e 1
2801..3733 + 310 1786184 thrB b0003 - - homoserine kinase
stdin – Your program will prompt the user for the name of an organism until a valid organism is found (.fna and .ppt file are found and opened
and loaded successfully), then continually prompt the user for a motif (DNA word) to search for.
organism_name.fna
organism_name.ptt
the entire Ecoli genome (Ecoli.fna) and entire protein table (Ecoli.ptt) can be downloaded from the following links:
INPUT
Text files
Complete Genome
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substrMG1655_uid225/U00096.fna
Protein Table
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substrMG1655_uid225/U00096.ptt
stdout - A sample run is shown below:
==========================================================================
Enter the Organism name (e.g., Ecoli): Ecoli
Attempt to Load Protein Table (list of genes) for organism: Ecoli
Attempt to open filename: Ecoli.ptt
Number of lines (excl. header) in Ecoli.ptt is: 4289
Protein table Ecoli.ptt successfully opened and loaded.
Attempt to Load DNA for organism: Ecoli
Attempt to open filename: Ecoli.fna
Number of lines of DNA (excl. header) in Ecoli.fna is: 54425
DNA Ecoli.fna successfully opened and loaded.
What motif would you like to find >> CCCGCACCTGACAGTG
CCCGCACCTGACAGTG found at bp 280
In between genes:
UPSTREAM Name: thrL Start: 190 End: 255 PID: 1786182
and
DOWNSTREAM Name: thrA Start: 337 End: 2799 PID: 1786183
Do you want to search for another motif? (y/n): n
Do you want to try another organism? (y/n): n
Note that in the sample run, the motif CCCGCACCTGACAGTG was found to start between the first two genes in the Ecoli protein table. Since the
motif CCCGCACCTGACAGTG starts at location 280 and the initial gene (thrL starts at: 190 and ends at 255) and the second gene (thrA starts at
337 and ends at 2799): 255 < 280 < 337.
Clearly, you will need to store the genes (protein table) in a bitmap and search that bitmap to determine the relative locations of the start of the
motif and the nearby genes. If a motif location starts within a gene, report:
motif found at bp xxx
WITHIN GENE Name: thrL Start: 190 End: 255 PID: 1786182
stderr - Your program should print the any errors on opening files, allocating memory on the heap, or other problems to standard error.
- Ask questions early.
OUTPUT
OUTPUT
Notes - So why are we giving you all this “starter code”? Most importantly, it is critical that you see a clean class interface. True, you might not have
written the interface in this way, however, rarely in the real world do you get to design things the way you want. This exercise includes some
not so subtle lessons in good software engineering. - Submit your answer to hr@digicompass.com
interface IGene {
long getStartBP(void);
long getEndBP(void);
string getStrand(void);
long getProteinLength(void);
string getPID(void);
string getName(void);
string getProduct(void);
void setStartBP(const long sbp);
void setEndBP(const long ebp);
void setStrand(const string s);
void setProteinLength(const long l);
void setPID(const string p);
void setName(const string n);
void setProduct(const string p);
};
interface IOrganism {
// ——————- organism info ————————————————————————-
string getOrganismName(void);
void setOrganismName(string s);
// ——————- protein(gene) info ——————————————————————
string getPttFilename(void);
long getNgenes(void);
Cgene getPtt(void);
void setPttFilename(string f);
void setNgenes(long n);
void setPtt(Cgene* p);
bool loadProteinTable(string organismName);
void printProteinTable2stdout(void);
void printProteinTable2stdout(long n);
// ——————- DNA info ———————————————————-
string getDNAFilename(void);
long getNnucleotides(void);
string getHeaderLine(void);
string getDNA(void);
void setDNAFilename(string f);
void setNnucleotides(long n);
void setHeaderLine(string h);
void setDNA(string d);
bool loadDNA(string organismName);
void headDNA2stdout(long n);
void bsearch();
};
