COSMOS Summer 2006, Lab Exercise 3: Finding Genes

Today's exercise

Yesterday in lecture, we discussed in some detail an algorithm (a recipe, or set of steps, for solving a problem) for finding candidate genes in a strand of DNA. In a previous lab exercise (and in a previous lab), we explored how to find a start codon, which indicates the beginning of a possible candidate gene. The solution to that problem forms the basis for the solution to the problem of finding candidate genes. Today, I'd like you to implement the algorithm we talked about in lecture yesterday for finding candidate genes.

Pair up! (Optional)

As always, you're welcome and encouraged to pair up, though it is optional, so if you prefer to work alone, you can. If you pair up, work in the same way you have previously: one computer shared between the two of you, one person "driving," the other watching, and the keyboard changing hands at least once every fifteen minutes.

One change for this week

Last week, all of the Python programs we wrote operated under the assumption that DNA strands are represented by a sequence of lowercase letters 'a,' 'c,' 't,' and 'g.' This is a reasonable representation and there's nothing wrong with it. However, many real databases in which biologists store DNA information represent DNA strands using sequences of uppercase letters instead. Since we're gradually working toward being able to write programs that process sequences of DNA stored in this real databases, it would be a good idea for us to start writing our programs to expect uppercase letters instead from now on; this way, we'll be able to reuse our programs later this week when we start reading our input from external sources.

(Note that, while it seems obvious to us that 'a' and "A" have the same meaning, it's not at all obvious to Python. As far as Python is concerned, 'a' and 'A' are different characters, so when we compare a character to 'a' like this:

    c == 'a'

the comparison will return False if c's value is 'A'.)

For the rest of this course, we'll use uppercase letters to denote nucleotides and amino acids.

Today's problems

During the course of today's lab session, I'd like you to work on the following problems:

Write a program that finds candidate genes in a single DNA strand represented by a sequence of uppercase letters 'A,' 'C,' 'T,' and 'G.' For our purposes, we'll define a candidate gene as a sequence of codons, beginning with a start codon ('ATG') and ending with a stop codon ('TAA,' 'TAG,' or 'TGA'), that is at least 20 codons long (including the start and end codons). For each candidate gene found, your program should print "Candidate gene found," as well as the position in the sequence where the 'ATG' begins and the number of codons in the candidate gene. (Remember, when reporting the starting position, that, unlike Python, biologists count from 1 rather than 0.)
Using the program that we wrote in lecture to find the reverse complement of a DNA strand and your solution to the previous problem, write a new program that has, as input, one strand of DNA, but finds all candidate genes that appear on either the given strand or its reverse complement. It should behave similarly to the previous program, but when it finds a candidate gene, it should say either "Candidate gene found on given strand" or "Candidate gene found on reverse strand."
- Here is a link to the program we wrote in lecture yesterday to find the reverse complement of a DNA strand.
Write a program that uses a dictionary to translate a DNA strand to a sequence of amino acids. Each amino acid is to be represented as a capital letter, as shown in this table. Use the character '!' (an exclamation mark) to indicate the amino acids marked "End" in the table.

Testing your solutions

One of the challenges of writing programs is knowing whether they work when you're done. The most common way to approach that problem is to test them for various inputs and see whether the results match our expectations. Rather than providing you with test data today, work on designing some of your own test data to verify that your programs work. As examples, some things to think about when testing your candidate gene finder are:

What happens when there's only one ATG and only one stop codon in the DNA strand?
What happens when there are multiple candidate genes that do not overlap one another?
What happens when there are multiple candidate genes that overlap off-frame (e.g., "ATGCATG...", where the second ATG is within the first candidate gene but does not appear "on frame" as a codon in that candidate gene)?
What happens when there are multiple candidate genes that overlap on-frame?
What happens when there are no candidate genes at all?

The course staff is glad to help you design your test data if you're not sure how to approach it. The emphasis, in general, is not on making data up randomly; the idea is to think of data that stands a good chance at finding a mistake in your program (if there is one).

Have fun!

Enjoy! Solutions will appear on the web site within the next day or so.