| Introduction |
This programming assignment is designed to ensure that you know how to write
functions and scripts (using functions) that manipulate lists of data.
You will also continue gaining experience using the standard control structures
in Python: blocks, ifs, for loops, while loops,
break, try/except, return, and raise statements.
In the first part of the assignment, you will write a variety of generally useful list-processing functions in the listlib.py module (which are used in the next two parts); in the second part you will a small script (and some more functions) in the secretary.py module, which solves a problem in operations research; in the third part you will write a small script (and some more functions) in the dnamaker.py script, which uses shot-gun sequencing to assemble DNA fragments into a strand. Import whatever modules/functions you think appropriate, using whatever form of import you think appropriate (try to use both forms appropriately). Use short (but not overly short) well-chosen names. For indefinite loops, feel free to code directly with while boolean-expresssion: loops, or continue writing while True: loops with if/break statements in their blocks; always strive to write the simplest loops. You will download the program5 project folder, unzip it, move it to your workspace, create a project from it, and write the three required modules in this project folder. Submit each module separately in Checkmate. Only one student should submit all parts of the the assignment, but both student's names should appear in the comments at the top of each submitted .py file. It should look something like # Romeo Montague, Lab 1 # Juliet Capulet, Lab 1 # We certify that we worked cooperatively on this programming # assignment, according to the rules for pair programmingPlease turn in each script as you finish it, so that I can accurately assess the progress of the class as a whole during this assignment. Print this document and carefully read it, marking any parts that contain important detailed information that you find (for review before you turn in the files). Reread the section on Time Management from Programming Assignment 0 before starting this assignment. |
| A Module of General List-processing Functions |
Write a module named listlib.py that mirrors the library modules in
the courselib folder: it contains the following functions, each of
which might be useful in a variety of scripts, and all of which will be used
somewhere in this assignment to get an important job done.
Note that the listlib.py module itself contains the
count_predicate and rank functions already written in the class
notes, and some code to test the functions you must write.
Test/debug each function immediately after you define it.
Note that we use the terminolgy "a value satisfies a predicate" to mean that
the predicate returns True when passed that value as an argument.
TestingThe listlib.py module itself includes the standard tests. Recall, the code in the botom if is executed if we run this module as a script (not if it is imported by another module). You can certainly add more tests here if you wish. |
| Secretary Hiring Simulation |
Write a script that defines various functions that simulate different aspects
of "The Secretary Hiring Problem", defined as follows.
Suppose that we can interview up to 100 people to hire a secretary (that many have applied for the job). After interviewing each person, we must decide either to hire them or remove them from consideration for the job. What is a good strategy such that we will end up hiring a good secretary? Some books cast this problem in terms of choosing someone to marry: once you say no, they are never coming back. The first question to ask is, "What is the desired outcome?" One possibility would be to maximize the percentage of times the best person (highest ranked) is hired; another would be to maximize the percentage of times one of the best 5 people (highest 5 ranked) is hired; or generally, to maximize the percentage of time one of the best B people (highest B ranked) are hired. Given such a desired outcome, the next questions are how to achieve that outcome the highest percentage of the time and what will that percentage be? The strategy that we will use is to pre-interview some number (N) of candidates, not hiring any of them no matter how good they are. Then we continue interviewing more candidates, but now in earnest (start real hiring), hiring the first one who is ranked better than the best of the first N candidates that we pre-interviewed: if no one exceeds the ranking of the first N candidates, we will have to hire the last person we interview, no matter what their rank. So, the first N candidates just give us a sampling/feeling for how well qualified the pool of candidates is. So the key question is, "How big should N be?" If N is very low, then we are likely to see FEW high-ranked candidates, so the rank of the one that we ultimately hire (who must be better ranked than the first N) will likely also be low. But likewise, if N is very large, we are likely to see ALL the top-ranked candidates before making real hiring starts, so the rank of the candidate we hire will also be low (since we will have seen the top-ranked candidate, we will hire the last person applying for the job). So, we need to chose an N high enough so that we see some high-ranked candidates, but not so high that we see too many (so some are still left to hire). So this is a problem of picking an optimal N, that maximizes the probability of the outcome we are looking for. Many optimization problems can be stated and solved in a similar manner. Define your functions/script in the secretary.py module.
Write lambdas for the is_legal parameter for the prompts: the
number of candidate must be >= 10; the number of candidates to
pre-interview must be >=1 and < the number of candidates entered above
it (and that number, here 100, should appear in the prompt text).
TestingTest each function on small amounts of data as you write it; then write the script and run it, calling (and therefore testing) these same functions with more complex data. My initial script code was 7 lines, with another 6 lines to handle the iteration over different pre-interview values. |
| DNA Assembly |
Suppose that we are trying to sequence (determine the bases in) a large DNA
strand (typically containing tens of thousands of bases up to bilions in
the human genome).
Although we can directly sequence small strands (hundreds of bases), we cannot
directly sequence bigger strands.
But there is a "divide and conquer" (computationally intense) process, called
shotgun sequencing, that allows us to use the power of the computer
to recover a large DNA strand by matching/combining its smaller constituent
strands (called fragments).
In shotgun sequencing, first we clone (duplicate) the strand. Then, we break the resulting clones into smaller fragments, each containing hundreds of bases. Breaking is accomplished via restriction enzymes (which cut DNA strands at certain combinations of bases), thermal dissociation (heating), or mechanical agitation (shaking) or combinations of all three. Using such methods, each DNA strand will break at different (and random) locations. Because these resulting fragments are so small, we can sequence them directly. Once we have sequenced them all, we enter each fragment's sequence into a large file that our program can read and process. Each line in the file is a string, with each DNA base represented by one of the characters, 'a', 'c', 'g', or 't'. The job of our program is to first read all these fragments (which contain lots of redundant information from the clones) and them reassemble them into the original, DNA strand. Our program starts by reading these fragment sequences into a list (so each element in the list is a string). Next it repeated searches for overlapping pairs of fragments (discussed in detail below), removes each individual fragment, combines the pair into a larger fragment, and places the larger fragment back into the list. This combining process continues until the program finds no more fragments that overlap. With a bit of further processing (see the details below) we try to reduce all remaining fragments into one fragment, which is actually the original strand.
Here is an example of a small DNA sequence and some sample fragments broken at
random locations from four clones: a-d.
I have labelled fragments of each close for easier reference: e.g., a0-a2 is
the entire original sequence.
Notice that we can re-assemble the original sequence from these clones as
follows (the actual algorithm is explained below via functions):
d0 overlaps a1, and a1 overlaps b2 as follows.
Real shotgun sequencing performs exactly these operations, but on much messier data than we will use: our data is computer generated; our programs can assume perfect breaking with no loss or corruption of information. This assumption does not hold in real-world experiments, so the real-world assembly programs must be more sophisticated (beyond the capacity of 7th week introductory programming students :). Nevertheless, we will get a tremendous amount of insight into the problem by writing a program that operates on simplfied data. Define your functions/script in the dnamaker.py module.
nter file to process: trivialdna.txt Enter minimum overlap: 2 Tracing on?[True]: assembling with 3 fragments ['aacc', 'ccgg', 'ggtt'] Removing upper = 'aacc' and lower = 'ccgg' Adding 'aaccgg' assembling with 2 fragments ['ggtt', 'aaccgg'] Removing upper = 'aaccgg' and lower = 'ggtt' Adding 'aaccggtt' assembling with 1 fragments ['aaccggtt'] Assembled with 1 strand(s) remaining aaccggtt Answer written into file: assembled-trivialdna.txt TestingTest this script on the supplied data files. Note that each as an answer file:e.g., trivialdna.txt and trivialdnaanswer.txt.
|