Assignment 4. Multiple Sequence Alignment
You can search for ClustalW or use the program at http://www.ebi.ac.uk/clustalw/. You should not change any defaults. It is not an acceptable excuse to say that ClustalW was unavailable when you wanted to use it. Therefore do this assignment early.
  1. On the same viral sequences as the last assignment, run the ClustalW program to produce a multiple alignment. In your *.doc file, put the alignment of the 10 viruses that end with position 300. This is just part of entire output of ClustalW.
  2. Columns in the alignment are marked with a star, colon, period and a blank. Select the columns corresponding to the first occurrence of each of these symbols and form the Probability Weight Matrix. The full probability weight matrix for these 4 columns would have 20 rows, but you can simplify your answer to only include those amino acids that have varying probabilities plus an extra row for all the rest.
  3. What is the entropy of each column. Entropy was defined in ics171 and is sum of -pi*log(pi) where the log is taken base 2 and pi is the probability of entry i, in this case of amino acid i. Recall that we define 0*log(0) as 0 since the limit e*log(e) as e goes to 0 is 0.
  4. What is the match score of the sequence NNNN with this PWM. Recall that the match score is the sum of the corresponding probabilities.
  5. What sequence of four amino acids would yield the highest score and what is that score?
  6. Does the ClustalW program guarantee to find the optimal multiple alignment?