H23 Homeworks: Typicallyl these homeworks have a theory portion and a design or coding portion.

Warning: Only the homework for the next assigment is guaranteed to be correct and complete. Look at the assignment early and ask questions if you do not understand what you are supposed to do.

NOTE: Due Time: Coding assignments are due (i.e. need to be deposited) by 10pm on tuesday of the week that the homework is due. Work that should be handed in is due at the beginning of class on Wednesday of the week that homework is due. Late homeworks will be marked down by 20% for each day that it is late.

Special Homework 0: Due wednesday by 5pm.

1. Tell me about any program that you have written purely for fun, i.e. not for any class assignment. If none, say so.

2. Tell me about a program that you would like to write.

Homework 1: Goal: Review + practice object-oriented design + simple Gui interface

Read chapter 1. chapter 1 has some useful program examples as well as a review of some basic mathematical techniques.

1. . This is a finite induction problem.

Prove: sum(i = 1 to i = N ) { 2*i -1} = N^2.

2. This part requires no coding, only design. Suppose that you are coding a very limited banking systems. Customers can only: open a savings account, close a savings account, add money to an account, and withdrawn money from an account. You may assume that this is a new bank which begins with no customers. In the style of the lectures, define appropriate objects and methods. You should only turn in the your final design, not the steps you used to create the design. Also include the driver program, ie. the program which puts the objects together. This should look like a code, but would not execute since the needed objects have not been defined. Some people like to start by writing the driver program, guessing what objects they might need.

3. This part of the assigment involves writing a simple GUI interface. From the user get a file name. The file you will use is WM.txt which is in the class folder XX. Your program will display: a) the total number of letters (characters from a through z), b) for each letter the number of times it occurs, and c) a graphical plot of the frequency of each letter. More specifically, you plot (i-th letter, frequency of i-th letter). Upper case letters should be "normalized" to lower case letters. This can be conveniently done by transforming every read String into a String of lower case letter via a String method.

Homework 2: Goal: Reinforce familiarity with O notation + Proofs

1. Do problem 5.14, only part a. ( 6 code fragments to analyze)

2. Suppose T(N) = O(f(N)) and S(N) = O(g(N)).

a) What is T(N)*S(N)?

b) Prove it.

c) What is T(N)+S(N)?

d) Prove it.

3. Definition: A set S is convex if whenever points p and q belong to the set, then for any x between 0 and 1, the point x*p + (1-x)*q belongs to S.

Prove that if S1 and S2 are convex, then S1 intersect S2 is convex. The proof is straightforward, as long as you don't lose your head. Do not attempt a geometric proof, although that might provide insight/confidence in the result. Algebraic proofs are usually easier, once you know what might be true.

Homework 3: Goal: Use Gui Interface and standard Collection Classes

This assignment is somewhat similar to assignment 1, but adds the uses of Collection classes. In this assignment you will build a Gui interface that performs a simple statistical analysis of a text document. Use the swing classes as the awt library is deprecated. Using BorderLayout, in the north panel ask the user for a file to be processed. In the west panel display, in sorted order, the frequency of all the words in the document. In the center panel display a graph of the 100 most frequent words, in sorted order. The y-axis measures frequency of occurrence as a percentage of the i-th most frequent word. So you plot the points (i-th word, frequences of i-th word). On the north panel display the total number of words and the total number of different words.

You will read the file WM.txt again. One important question is: What constitutes a word? For the purposes of this assignment a word is a sequence of characters surrounded by white-space where the first and last characters are letters. Also you need to remove final (not internal) punctuation. Hence Jackson7 is not a word and won't be counted. Use StringTokenizer to identify candidate words. Use a Hashtable or HashMap, provided in the Collections package, to keep track of the number of occurences of every word. Then use TreeSet, also provided in the Collections package, to sort the words. You will sort the words by frequency so you need to define a Comparator.

Start this assignment early. Do not write this is one fell swoop. Instead, start by writing a few of the simplest throw-away programs that you can imagine that develop your understanding and confidence in individual features of the Java language.

To help you with this assignment, I'm giving you a class I wrote called FileTokenizer that you might find useful. Conceptually it is much like StringTokenizer on a file. It basically provides an iterator, but I did not make it implement Iterator. Why not? If you find errors or improvements in this code, please let me know.

import java.util.*;

class FileTokenizer
{
String line;
StringTokenizer stok;

FileTokenizer(String s)

{
try
{
stok = new StringTokenizer(line);
}
catch(IOException ioe) {};

}

String nextWord()
{
return stok.nextToken();
}
boolean hasWord()
{
try{
if (stok.hasMoreElements()) return true;

if (line == null) return false;

stok = new StringTokenizer(line); // misses blank lines
while ( !stok.hasMoreElements())
{
if (line == null) return false;
stok = new StringTokenizer(line);
}
return true;

}
catch(IOException ioe){};
return false;
}

}

Homework 4
: Goal: Review and compare use of lists, stacks and arrays.

Due Date extended to May 8, 10pm

Read chapter 3+ 4.3 Chapter 3 is on lists, stacks, and queues. 4.3 covers binary trees. Also read handout on Collections.

In this assignment you will implement a Cache (the inteface will be defined) in 5 different ways: namely as an array, a linked list, an ordered tree, as a linked list from the collections package (in java.util), and as a tree from the collections package. For each implementation give an O-notation analysis. Additional compare the implementation by generating 1,000,000 random numbers and putting them into a cache of size 1000. If these numbers are too small (depends on your computer) you may increase them. Or run the experiment multiple times. Which implementation of cache is performs best? What can you say about memory use, i.e. how much memory does each technique require (give more than an O(n) analysis).

A Cache is a bounded, ordered container for objects. Since it only stores objects, you will actually store Double objects. A concrete cache has a single constructor, cache(int bound) where bound is the number of elements to be stored. Only objects that are comparable are allowed to be entered into a cache. Duplicated objects (in this case doubles) are not stored.

The interface Cache has only one required method.

void add(Object o) .. may store the comparable object o. Object o is added if either the cache is not full or the object o is greater than some object currently stored.

The constructor is Cache(int bound) where bound is the maximum number of elements to be stored.

It is suggested that you also define

Iterator iterator() .. this will allow you to view see what is stored. For this, define remove() to do nothing. This is not trivial to define an Iterator() for Trees.

import java.util.*;
interface Cache
{
void add(Comparable o); // note that wrappers such as Double, Integer and the String class all implement Comparable.
}

You may defined additional methods, such as isFull() or remove(Object o) or whatever if you find them useful.

You may use your Cache class in later assignments.

Note: one homework has been cancelled.

Homework 5: Goals: Processing files, using hashtables and a cache. Different data structures for different goals.

Read chapter 5 or chapter on hashing

The program finds interesting kmers. A kmer is contiguous string of exactly k letters. Your program will read in two files and the size of k. Each file consists of characters from the alphabet {a,c,g,t}. The first file we will call the family.The second file we will call the background. To create a filereader from the Masterhit directory, you should use new File("\\\\Masterhit\\Instructional\\ics-h23\\files\\nit.txt"); You may think of the background as dna strings from the normal population and the family file as dna from people with some genetic disease.Your program goal is to find unusual (really statistically significant) kmers that occur suprisingly often in the family with respect to the background. For the purposes of this homework we define the unusualness of a kmer in the family as:

(number of times kmer occurs in family) - ( number of times kmer occurs in background)* (size of family)/(size of background).

This is the difference between he actual number of occurrences and the expected number of occurrences.

There are more appropriate statistical founded definitions, but involve more work and arrive at nearly the same results.

This value defines how to kmers should be compared.

To count the number of times every kmer in a family occurs use a hashtable which is part of the collections package (in java.util.*). For computing the background counts, you should only count the kmers that occur in the family. Again a hashtable is suitable. If you define it properly, you can use the same hashtable as before. For example an entry in the hashtable could consist of a pair of integers (family count, background count) and a real-values score of the unusualness.

There are three major steps (b,c,d) to do this task, plus a few minor ones (a,e).

a) You need to read into memory two files and store them as strings. (This isn't entirely necessary, but otherwise you will have to worry about substrings wrapping around the end of one line and the beginning of the next line). Both files are in Masterhit\Instructional\icsh23\files. The family file is in "nit.txt". This file is in standard "Fasta" format, the form that molecular biologists use to store information about genes or their surrounding regions.To process this file you need "skip" the comment lines. The net effect is that you read this file and form a single long string, which should have 3500 characters. The second file is the complete chromosome I of yeast (which has 16 chromosomes) and is called "chri_230203.txt". You can guess how many characters it has. Again you need to read the file into a single very long string. You can define the same reader class to process either file. (I will provide that little bit of code).

b) computing the number of times each kmer occurs in the family ( use a Hashtable or a HashMap). You need to define a class I call entry. Its what goes into the hashtable. The key to the entry is the kmer, a string. Since string have good predefined hashcode you don't need to define anything special. The class entry require at least two fields: int family count and int background count. To process the family file you consider each kmer in turn, and either enter it into the hashtable or update the family count if it is already their. After you process the family file, the family count will hold the number of occurrences in the family and the background count will be zero. The complexity of all this should be linear. The general rule is that you also avoid processing a file. Twice through a file is once too many.

The entry into the hashtable might have the fields: (string kmer, int familyCount, int backgroundCount, double score).

c) computing the number of times kmers in the family occur in the background (use the same hashtable). Processing this file is a a little different. For each kmer in the file, check to see if it occurs in the hashtable. If it isn't there you don't care. Otherwise you update the background count.

d) computing unusualness and sorting the kmers by this value. Luckily hashtables have enumerators and Hashmaps have iterators associated with them. Now you go thru the hashtable and enter the best scoring kmers into your cache (sorted, bounded) from a previous assignment.

e) finally you print out the top 20 kmers from your cache with kmersize of 6 (minor step) Instead of using your cache class, you may use TreeSet from the collections package. Actually you should print out the entry associated with each of these kmers. So the output would look something like:

Kmer # of times in family # of times in background Score

aaaaaa 13 121 .... (not the real answer)

etc.

Here is the code that will concatenate all the upstream regions into a single string. If you write this with s += br.readLine() you will have a huge (unacceptable) cost overhead. Instead (and in fact better) you could just read in and process each upstream region. This code has worked for me, but no guarantees that is errorfree. Complaints/improvements welcomed.

import java.io.*;

{
String data;

{
try
{
File file = new File(fileName);
StringBuffer sbuf = new StringBuffer((int)file.length());

while (line != null)
{
if (line.charAt(0) != '>')
sbuf.append(line);
}

data = new String(sbuf);
bf.close();
}

catch(IOException e)
{
}

}

}

.

Homework 6: Goal: Dynamic programming: the Needleman-Wunsch Algorithm.

Read chapter 10.3 Not covered in depth in either text.

The Needleman-Wunsch algorithm has many applications. The most famous application is helping to discover the function of proteins by finding similar proteins with known function. It could also be applied to spelling correction. It is the basis of time-warping algorithms in speech recognition. Your task also includes extending the algorithm so that it also produces an alignment. You can do this either with a graphical interface or a command line interface or using terminal io. The input to the program consists of two strings. The output is a) the Needleman-Wunsch similarity score and the alignment. The alignment can be illustrated via dashes, as in the following example: If using graphics, use a fixed size font.

Here's an example:

input string1: heagawghee

input string2: pawheae

output: score = -1

heagawghe-e

--p-aw-heae

A dash indicates that the character was skipped. The final score is unique, but their may be several alignments that achieve that score.

The Needleman-Wunsch algorithm will be discussed in class. To compute the alignment I suggest using a separate two dimensional array to record the "backpointers", although this is not necessary. One can also implement the algorithm to use linear space, but that takes more effort and care.

Last Assignment due June 5: Two week assignment. A Competition!

Top 5 performers, scored by length of the path produced, will get double grades, i.e grade will also replace another homework score, assuming it is a better grade.

Homework 7: Goal: Use local improvements algorithm on the traveling salesman problem. And Graphics.

Read Chapter 10.1 and 10.2 The topic is greedy algorithms

This assigment requires a graphical display. The inputs to the program are two integers. The first integer is a seed for the random number generator and the second integer is the number of cities. The code to generate the 2-d Points is provided below. Each point will have values that range from 0 to 100. Your program should display the initial path together with its length. There are a number of local improvements methods that you might try. At the minimum you should implement the "uncrossing" heuristic. You may add other operators or approaches as you choose. Your code will be evaluated on a random set of 40 cities.

The best "operator" I've found for improving a tour is to "remove crosses". Other operators, that are also useful , are swapping a pair of cities or moving a single city to a new point in the tour. You need only implement the "remove crosses" heuristic. A formula for detecting crosses is fairly simple. Let d[i][j] represent the distance from city i to city j. For convenience let i' and j' be the next city in the tour (you need to worry about wrap around). Then a "cross" exists between i and i' and j and j' if:

d[i][j]+d[i'][j'] < d[i][i'] + d[j][j'].

If you implement this with doubly linked lists, you can uncross the path with a few pointer moves. If you use an array to store the cities, then you will need to swap a number of cities. You may use the collection class.

The performance measure is the length of the tour that your program finds. Your program is constrained to not take too long, say not more than 2 cpu minutes for either problem.

Note the grade on the assignment is determined in the standard way, i.e. the code is correct and clean. However for the competition, its no holds bar. The only thing that counts is the length of the tour you find. However you would not want to double a poor grade. :)

Code for Generating an array of Points. Note this is like the Math functions, i.e. it is really just a long name for a function. may have This code guarantees that each city have unique coordinates.

=================== Code ===================

import java.util.*; // for the class Random
import java.awt.*; // for the class Point

class PointGenerator
{
Point[] pts;
PointGenerator(int seed, int size)
{
pts = new Point[size];
int [] xcoord = shuffle(new Random(seed));
int [] ycoord = shuffle(new Random(seed+1));
for (int i = 0; i<size; i++)
pts[i] = new Point(xcoord[i],ycoord[i]);
}

int[] shuffle(Random r)
// returns 100 random integers with no repeats from 0..99
{
int[] ans = new int[100];
for (int i = 0; i<100; i++)
ans[i] = i;
for (int i = 99, bound = 100; i>0; i--, bound--)
swap(ans, i, modulo(r.nextInt(),bound));
return ans;
}

int modulo(int i, int m) // because % doesn't compute modulo correctly
{
int temp = i%m;
if (temp<0) return temp+m;
return temp;
}

void swap(int[] a, int i, int j)
{
int temp = a[i];
a[i]= a[j];
a[j] = temp;
}

Point[] getPoints()
{
return pts;
}
}

================end of code =======================

Final Exam: June 11

Quiz 2 gives a reasonable idea of the form of the final, except it will be longer and cover the entire course material.

Homework Z:

Read Chapter 10.1 and 10.2 The topic is greedy algorithms

This assigment requires a graphical display. The input to the program is the number of cities. Each city will be placed randomly at (i,j) where i and j are between 0-100. Display the random path together with its length. Implement two ways of solving the problem: an exhaustive method and a local improvement method. The user should be able to specify which method. Each time a method improves the path, show the new path and the new length. Compare the effectiveness of each method on problems of size 10. To do this run each method on 10 problems. Of course the exhaustive method will find the best solution, but how good is the heuristic approach. Also record the amount of time each method makes. Also try both methods on larger problems. When does the exhaustive approach fail - i.e.when is it unable to solve the problem in a reasonable amount of time (reasonable = 1 minute)? Run the heuristic 10 times on the same problem of size 50, but randomize the initial ordering of the cities before each run.

The best "operator" I've found for improving a tour is to "remove crosses". Other operators, that are also useful , are swapping a pair of cities and moving a single city to a new point in the tour. You need only implement the "remove crosses" heuristic. A formula for detecting is fairly simple. Let d[i][j] represent the distance from city i to city j. For convenience let i' and j' be the next city in the tour (you need to worrry about wrapping around). Then a "cross" exists between i and i' and j and j' if:

d[i][j]+d[i'][j'] < d[i][i'] + d[j][j'].

If you implement this with doubly linked lists, you can uncross the path with a few pointer moves. If you use an array to store the cities, then you will need to swap a number of cities.

Homework X: N-Queens Problem

This assignment requires a graphics. Optionally you may let the user choose a board size, but you can set it at some reasonable number between 20 and 100. Display the board with a random placement of N-queens. Implement a greedy local improvement algorithm. For each cycle of the algorithm display the number of queen moves. Provide appropriate summary information once a solution is reached.

Homework X: Goals: Graphics, binary search