Assignment 6. Almost Markov Modeling
In this coding assignment you will use 3rd order statistic to model the putative genes "ManyYeastGenes.txt" deposited in Masterhit.
Using the model you will then predict whether some sequences are genes are not. The sequences are only pieces of DNA and do not contain the usual stop and stop sequences.
You should turn in the code (remember to concatenating all *.java files into one *.java file or zip them) and the output of the program in a separate *.doc file.
- First output: the model

To build the model you will read in sequences in fasta format. From this data you should compute the third-order statistics, i.e. frequency of each of the 64 codons, or triple of nucleotides. In building this model you are given the correct reading frame. Your program should output these frequencies in a readable way, for example:
aaa 16
aac 21 etc.
- Second output: the predictions of the Model on the strings used to build the model.

To score a sequence versus the model compute the correlation of the codon distributions. More specifically let x be the codon frequencies of the model and y codon frequencies of the string. Each of these is a length 64 vector. The correlation of x and y is x dot y/ squareRoot((x dot x) * (y dot y)). Dot is the inner product of two vectors.
Did you notice anything unusually in the output?
- Third output:
On masterhit there will five sequences, Unknowns.txt in fasta format. Give the scores relative to your model for each sequence. Since the reading frame is not known for the sequence, you need to compute the score for each reading frame. You need not compute the scores for the reverse complement, so only three reading frames need to be consider. From these scores, identify which sequences are likely to be genes and which are not.