ICS23 Summer Homeworks: Typically these homeworks have a theoretical portion and a coding portion. For all methods of all classes, you are required to provide a O-notation space and time complexity analysis unless the space and time are O(1).

Warning: Only the homework for the next assigment is guaranteed to be correct and complete. Look at the assignment early and ask questions if you do not understand what you are supposed to do.

NOTE: Due Time: Coding assignments are due (i.e. need to be deposited) by NOON of the first class day of the week that the homework is due. Another handout will explain how this should be done. Work that should be handed in is due at the beginning of class of the week that homework is due. Late homeworks will be marked down by 20% for each day that it is late.

In general, if questions are asked about your code, you should answer them in the documentation. For example you should document the time and space complexity of all methods when not O(1).

Note: Submit only *.java and *.vep files, no *.class files (which are large). See Visual Cafe Lab guide on Li Zhang's web site.

Homework 1: Goal: O-notation+ review of arrays, lists and unbalanced trees.

Read chapter 1. chapter 1 has some useful program examples as well as a review of some basic mathematical techniques.

1. Do problem 5.14, only part a. ( 6 code fragments to analyze). The solutions to these problems can be submittted in a separate file *.txt file. This should be fairly short, such as: // analysis: 1. O(n^3) 2. O(nlogn) 3. O(n^100) etc. Note; not real answers. The name should be analysis.txt

The quiz will contain problem similar to these.

chapter 3+ 4.3 Chapter 3 is on lists, stacks, and queues. 4.3 covers binary trees.

In this assignment you will implement a Cache in 4 different ways: namely as an array, a linked list, as binary tree, and as a tree (new TreeSet()) from the Collection package (in Java.util). The goal of a cache is to store the best values. Teh best values are the highest values. They do not have to be stored in order, but this is sometimes convenient. Your program will have a single input: an integer that defines the seed for a general random niumber generator, i.e. Random r = new Random(seed) which in the util.* library. Then you can get random integers view r.nextInt(). For each implementation give an O-notation analysis. Additional compare your implementations by generating 10,000 random integers and putting them into a cache of size 20. Duplicated integers can be stored or not, as your choice. Just comment which you do. Which implementation of cache performs best? What can you say about memory use, i.e. how much memory does each technique require? For each algorithm print out the final state of the cache. Should the caches be identical?

Note: To insure that each cache is given the same problem, you should reinitialize the random number generate with the given seed for each Cache that you define. Also, to get timing results, you should the function System.currentTimeMillis(), which returns a long.

Summary: input: single integer that defines the seed for the Random.

Output: for each cache algorithm, it's time and the final elements in the cache.

Questions should be answered in the documentation for your code. In particular give the O analysis for the cache where the number of data elements to be examined is N and the size of the Cache is K. The answer should be O(some function in terms of K and N).

You may use your Cache class in later assignments.

Code Hints: TreeSet only allows you add objects, hence you may want to use the wrapper Integer to store the values. Look at the methods on TreeSet so that you don't store too many elements.

Homework 2. Goal: Practice creating and using Trees. Huffman trees are discussed in 12,1-12,4. Chapter 18 discusses trees. Key sections are 18.4 on AVL trees and 18.7 on B-trees.

In this assignment you will create optimal codes for the characters a,b,....z by generated a Huffman Tree. You may use any of the classes in util that you find useful. The input to your program is the name of a text file. The text file will be used to evaluate your program. Uppercase letters should be counted as their lowercase equivalent. Numbers, punctuations etc are not to be counted. I suggest that you create several simple text files to check that your program is working correctly. The output of your program is

a) 26 lines consisting of the letter and its frequency of the letter, in alphabetic order

b) 26 lines consisting of the letter and its code. The letters may be in any order.

. These entries should be separated by a tab.

So an example output might look:

a 132

b 15

etc

and

a 001

e 10001

etc.

Summary: input: name of a file

output: list of letters a..z with their frequence and a list of letters with their Huffman codes.

Coding Hints: The main two activities are reading a file, counting the frequency of the characters and building a Huffman Tree. Hence I expect that there are two primary classes. Here would be my plan. You do not need to follow my plan.

Monday: write the driver program. Hence I would have at least two auxiliary classes: CountCharacters(String fileName) and HuffmanTree(int[] array).

Tuesday: write the CountCharacter class and test. Note if br is a BufferedReader, then br.read() returns -1 if you are at the end of the file. By using char ch = (char)br.read() you will get the next character in the file. The Character method isLetter(char c) will tell you if it is a letter. The Character method toLowerCase(char c) returns the lower case equivalent of the letter. Assume ch is a lower case letter, then (ch-'a') will be 0 and (ch)('a'+i) will be the ith character of the alphabet. Note that you can test this class independent of creating the Huffman class.

Wed/Thurs: write the Huffman class, which generates codes for each character. Since the Huffman codes are based on building a special type of tree, I expect to have additional classes such as HuffmanNode. These trees are built bottom-up, which is not the usual way. You may find the LinkedList or Vector class useful here. You should think carefully about what information you store in each node. Also note that you could write and test this class without writing CountCharacters, e.g. by inputting simple test arrays where you know what the answer is, such as:

int[] test = {2,2,2,2} and then running HuffmanTree(test).

I find that when I make a logical error in coding, its very difficult to find the same day. Hence it's best for me to get a night's sleep and retry the next day. That happened to me in coding the Huffman tree.

Homework 3. Goals: Processing files, using hashtables and a cache. Using different data structures for different goals.

Read chapter 5 or chapter on hashing

The program finds interesting kmers. A kmer is contiguous string of exactly k letters. For example in the string ACTACTA there are 4 4-mers, namely ACTA, CTAC, TACT, and ACTA. Note that we allow overlaps. Your program will read in two files and the size of k. Each file consists of characters from the alphabet {a,c,g,t}. The first file we will call the family.The second file we will call the background. To create a filereader from the Masterhit directory, you should use new File("\\\\Masterhit\\Instructional\\ics-23\\files\\nit.txt"); You may think of the background as dna strings from the normal population and the family file as dna from people with some genetic disease.Your program goal is to find unusual (really statistically significant) kmers that occur suprisingly often in the family with respect to the background. For the purposes of this homework we define the unusualness of a kmer in the family as:

(size of background)*(number of times kmer occurs in family) - ( number of times kmer occurs in background)* (size of family)

This is the difference between the actual number of occurrences and the expected number of occurrences, normalized to be an integer.

There are more appropriate statistical well-founded definitions of surprisingness, but involve more work and arrive at nearly the same results.This value defines how to kmers should be compared. The size of the background is the length of the string that the FastaReader generates. Same for the size of the family.

To count the number of times every kmer in a family occurs use a hashtable which is part of the collections package (in java.util.*). For computing the background counts, you should only count the kmers that occur in the family. Again a hashtable is suitable. If you define it properly, you can use the same hashtable as before. For example an entry in the hashtable could consist of a pair of integers (family count, background count) and a real-values score of the unusualness.

The first step: Make a schedule. Write the driver the first day and then allocate days for the rest. If you do not write the driver by Wednesday, then I think you are very far behind schedule. By writing the driver you can focus your questions.

There are three major steps (b,c,d) to do this task, plus a few minor ones (a,e).

a) You need to read into memory two files and store them as strings. (This isn't entirely necessary, but otherwise you will have to worry about substrings wrapping around the end of one line and the beginning of the next line). Both files are in Masterhit\Instructional\icsh23\files. The family file is in "nit.txt". This file is in standard "Fasta" format, the form that molecular biologists use to store information about genes or their surrounding regions.To process this file you need "skip" the comment lines. The net effect is that you read this file and form a single long string, which should have 3500 characters. The second file is the complete chromosome I of yeast (which has 16 chromosomes) and is called "chri_230203.txt". You can guess how many characters it has. Again you can do this by reading the file into a single very long string. You can define the same reader class to process either file. (I will provide that little bit of code).

b) computing the number of times each kmer occurs in the family ( use a Hashtable or a HashMap). You need to define a class I call entry. Its what goes into the hashtable. The key to the entry is the kmer, a string. Since strings have well predefined hashcodes you don't need to define anything special. The class entry requires at least two fields: int family count and int background count. To process the family file you consider each kmer in turn, and either enter it into the hashtable or update the family count if it is already there. After you process the family file, the family count will hold the number of occurrences in the family and the background count will be zero. The complexity of all this should be linear. The general rule is that you also avoid processing a file. Twice through a file is once too many.

The entry into the hashtable might have the fields: (string kmer, int familyCount, int backgroundCount, long score). Note: The Java hashtable uses rehashing, so the table size will automatically be expanded as needed. However since you may have as many as 3000 entries, you could use new Hashtable(6000) to initialize the hashtable with size 6000.

c) computing the number of times kmers in the family occur in the background (use the same hashtable). Processing this file is a a little different. For each kmer in the file, check to see if it occurs in the hashtable. If it isn't there you don't care. Otherwise you update the background count.

d) computing unusualness and sorting the kmers by this value. Luckily hashtables have enumerators and Hashmaps have iterators associated with them. Now you go thru the hashtable and enter the best scoring kmers into your cache (sorted, bounded) from a previous assignment. It is not necessary to use a cache.

e) finally you print out ( to the console) the top 20 kmers from your cache with kmersize of 6 (minor step) Instead of using your cache class, you may use TreeSet from the collections package. Actually you should print out the entry associated with each of these kmers. So the output would look something like:

Kmer # of times in family # of times in background Score

aaaaaa 13 121 ... (not the real answer)

etc.

Here is the code that will concatenate all the upstream regions into a single string. If you write this with s += br.readLine() you will have a huge (unacceptable) cost overhead. Instead (and in fact better) you could just read in and process each upstream region. This code has worked for me, but no guarantees that it is errorfree. Students have used it without problems. Complaints/improvements welcomed.

import java.io.*;

{
String data;

{
try
{
File file = new File(fileName);
StringBuffer sbuf = new StringBuffer((int)file.length());

while (line != null)
{
if (line.charAt(0) != '>')
sbuf.append(line);
}

data = new String(sbuf);
bf.close();
}

catch(IOException e)
{
}

String getData()

{

return data;

}

}

Homework 4: Comparison of Sorting Routines (Last homework!)

Note: As always form a work plan. If you implement one sorting routine and the driver, you will have a good estimate for thw work involved.

Use Random r= new Random(0) for constructing the random number generator. Using this r, construct 4 test arrays of size 1000, 2000, 4000 and 8000 filling them using r.nextInt(). You will implement four sorting algorithms and time their performance on each array. The four sorting algorithms are: BubbleSort, Mergesort, HeapSort, and QuickSort. However the text code is for arrays of objects and you will be sorting an array of integers. The text code provides a strong outline for the code, but you should be able to greatly simplify it. Or you can write your own code for the algorithms.. Verify the O-notation formula for the running time of each algorithm by creating the following table for each algorithm where n is the number of data items. Use this data to estimate the constant in the O-notation for the running time, i.e instead of saying O(n^2) say 23.5*n^2.

BubbleSort time/1k time/2k time/4k time/8k meaning the running time for to sort 1k, 2k, 4k and 8k entries divided by the size of the array.

time/klogk time/2klog(2k) time/4klog(4k) time/8klog*8k)

time/k^2 time /4k^2 time/(16k^2) time/64k^2

If some row has an approximately constant value, then you can hypothesize that the running is time is that constant*(appropriate polynomial).

output: For each algorithm, a table of its performance.

Also (in the document of the code) the precise polynomial you estimate for the running time of each algorithm.

Besides the Driver program, this programshould have the obvious 4 classes, one for each sorting routine. It may be convenient to have additional, auxiliary classes for the sorting routines or for measuring performance.