COSMOS Summer 2006
Lab Exercise #3: Finding Genes

Today's exercise

Yesterday in lecture, we discussed in some detail an algorithm (a recipe, or set of steps, for solving a problem) for finding candidate genes in a strand of DNA. In a previous lab exercise (and in a previous lab), we explored how to find a start codon, which indicates the beginning of a possible candidate gene. The solution to that problem forms the basis for the solution to the problem of finding candidate genes. Today, I'd like you to implement the algorithm we talked about in lecture yesterday for finding candidate genes.

Pair up! (Optional)

As always, you're welcome and encouraged to pair up, though it is optional, so if you prefer to work alone, you can. If you pair up, work in the same way you have previously: one computer shared between the two of you, one person "driving," the other watching, and the keyboard changing hands at least once every fifteen minutes.

One change for this week

Last week, all of the Python programs we wrote operated under the assumption that DNA strands are represented by a sequence of lowercase letters 'a,' 'c,' 't,' and 'g.' This is a reasonable representation and there's nothing wrong with it. However, many real databases in which biologists store DNA information represent DNA strands using sequences of uppercase letters instead. Since we're gradually working toward being able to write programs that process sequences of DNA stored in this real databases, it would be a good idea for us to start writing our programs to expect uppercase letters instead from now on; this way, we'll be able to reuse our programs later this week when we start reading our input from external sources.

(Note that, while it seems obvious to us that 'a' and "A" have the same meaning, it's not at all obvious to Python. As far as Python is concerned, 'a' and 'A' are different characters, so when we compare a character to 'a' like this:

    c == 'a'

the comparison will return False if c's value is 'A'.)

For the rest of this course, we'll use uppercase letters to denote nucleotides and amino acids.

Today's problems

During the course of today's lab session, I'd like you to work on the following problems:

Testing your solutions

One of the challenges of writing programs is knowing whether they work when you're done. The most common way to approach that problem is to test them for various inputs and see whether the results match our expectations. Rather than providing you with test data today, work on designing some of your own test data to verify that your programs work. As examples, some things to think about when testing your candidate gene finder are:

The course staff is glad to help you design your test data if you're not sure how to approach it. The emphasis, in general, is not on making data up randomly; the idea is to think of data that stands a good chance at finding a mistake in your program (if there is one).

Have fun!

Enjoy! Solutions will appear on the web site within the next day or so.