COSMOS Summer 2006
Lab Exercise #6: Finding Similar Sequences in Real Data

Today's exercise

Recently, we've learned how to connect our Python programs to real-world data, by writing a function that reads DNA sequences from a FASTA file and returns a list of those sequences. Today, we'll connect our sequence similarity algorithm from class to real data, running it on all pairs of DNA sequences in a FASTA file to find the pair of sequences that has the highest similarity score.

Alternatively, if you'd prefer to spend this time working on your project, you may choose to do that instead. With the deadline approaching, there's no time for games and movies today, so please stay on task.

Useful functions

Below are the functions we've written previously in lecture to read sequences from a FASTA file and to calculate sequence similarity for a pair of DNA sequences.

You may also find a use for other programs or functions you've written during previous lab sessions.

Today's program

This morning, we'll just be building one program, which does the following:

• First, it asks the user to supply the name of a FASTA file.
• Next, it reads the DNA sequences out of that file, storing them in a list. (Note that our readSequencesFromFastaFile does this whole job, though you'll have to call it.)
• Then, it runs our sequence similarity algorithm on every pair of sequences. (See below for some help on how to set that up.)
• Finally, print the sequence numbers of the pair with the highest similarity score. For the purposes of reporting the result, count the sequences from 1, so that the first sequence is number 1, the second is number 2, and so on.

Try to write the program step-by-step, as we've been doing in class, testing the steps as you go. For testing purposes, it would be smart to create your own FASTA file with a few short sequences in it, so that you'll be able to work out the intended result by hand and verify that the program is working at each step.

Looping through all pairs of sequences

A common pattern in programming is to iterate over all pairs of elements in some collection. In our case, we want to iterate through the pairs of DNA sequences in a list of DNA sequences. Recall how we used two "nested" loops (one inside the other) to iterate through the cells of our matrix in the sequence similarity algorithm; this allowed us to reach every cell, even though the matrix was two-dimensional. Looping over pairs is similar, except that you don't want to handle pairs more than once each. The following is an example of printing out each pair of numbers in a list of numbers:

numbers = [10, 30, 20, 50, 40]

for i in range(0, len(numbers)):
for j in range(i+1, len(numbers)):
print numbers[i], numbers[j]

Type this code into the Python interpreter and notice how all the pairs are shown, but no pair of numbers appears twice. (For example, the pair "10 20" is shown, but "20 10" is not.)

You'll need to use a similar pattern when you loop through the pairs of sequences, calling the sequence similarity algorithm on each pair instead of printing them out.

Finished early?

If you're done with the assignment, spend the remaining time working on your projects.

• Originally written by Alex Thornton, Summer 2006.