Program 1

Iteration and Major Data Types:
List, Tuple, Set, and Dict (and Open for files)

ICS-33: Intermediate Programming


Introduction This programming assignment is designed to ensure that you know how to use combinations of Python's most important data types to model and compactly write code that solves a wide variety of different programming problems. The kind of abstract thinking that goes into modeling solutions to these programming problems with these data types (and iteration over them) is critical to your development as computer scientists.

There are five parts to this assignment (and and extra credit part at the end). In each you will be asked to write a module that contains a few functions and a script at the bottom, which ties these functions together to solve the problem.

You should download the program1 project folder and use it to create an Eclipse project. You will create each modules in this project, and submit each module separately in Checkmate. The project folder contains no modules, but it does contain all the data files you need to test/debug you modules.

You should work on this assignment in pairs, with someone in your lab section. Try to find someone who lives near you, with similar programming skills, and work habits/schedule: e.g., talk about whether you prefer to work mornings, nights, or weekends; what kind of commitment you will make to submit program early. If you believe that it is impossible for you to work with someone, because of some special reason(s), you should send me email stating them and asking for special permission to work alone (which I do grant, but not frequently).

Only one student should submit all parts of the the assignment, but both student's names should appear in the comments at the top of each submitted .py file. It should look something like


# Romeo Montague, Lab 1
# Juliet Capulet, Lab 1
# We certify that we worked cooperatively on this programming
#   assignment, according to the rules for pair programming
Please turn in each program as you finish it, so that I can accurately assess the progress of the class as a whole during this assignment.

Print this document and carefully read it, marking any parts that contain important detailed information that you find (for review before you turn in the files). The code you write should be as compact and elegant as possible, using appropriate Python idioms. You should familiarize yourselves with the safe_open function in the goody module and all the functions in the prompt module, both of which you should have installed in your courselib folder as part of the Eclipse/Python installation. Recall how to use the sep and end parameters in the print function. Finally, reread the section on Time Management from Programming Assignment 0 before starting this assignment.


Problem #1: Reachability

Problem Summary:

  • Write a script that prompts the user to enter the name of a file representing a graph.
  • Read the information in the file, storing the graph in a dictionary.
  • Print the graph.
  • Repeatedly prompt the user for a starting node in the graph, and compute and print all the nodes that are reachable from it by following edges in the graph.

Input and Output

Read a file of pairs of node names (representing edges) in a directed graph, building a dict whose key is a str source node and whose value is a set of str destination nodes that are each reachable from the source node key. We write this informally as dict[str] -> set{str}

Two nodes appear on each line: first the source node, then the destination node, with these node names separated by one semicolon character. For example, the input file graph1.txt contains the following lines (which could appear in this order, or any other):

  c;f
  b;d
  a;c
  c;e
  d;g
  a;b
  e;d
  f;g
  f;d
which represent the graph

Print the graph, one source node per line (the source nodes are printed alphabetically) followed by the set of all the destination nodes that the source can immediately reach. The graph above would print as

  Graph: source -> {destination} edges
     a -> {'c', 'b'}
     b -> {'d'}
     c -> {'f', 'e'}
     d -> {'g'}
     e -> {'d'}
     f -> {'g', 'd'}

Note that the source nodes are sorted alphabetically, but the set of desintation nodes does not have to be sorted: in fact it makes no sense to talk about sorted sets; we could talk about a sorted list whose contents came from a set. Note that because node g is not a source node (it is only a destination node), it does not appear first on any line (and appears only in the sets for source nodes d and f.

Note that there are multiple data files for this program: graph1.txt, graph2.txt and graph3.txt; test/debug your program on the first file; when you are done, test it on the last two. Draw the graph represented by each for to ensure that your code correctly prints it and computes the nodes reachable from any source node (which you can do by eyeballing the graphs: they are small).

Repeatedly prompt the user for a starting node in the graph (until quit is entered) and compute and print all the nodes that are reachable from it by following edges in the graph. Reject any node not present in the graph. An example interaction (processing the graph above) might be

  Enter starting node: e
  From e the reachable nodes are {'g', 'e', 'd'}

  Enter starting node: a
  From a the reachable nodes are {'g', 'f', 'e', 'd', 'c', 'b', 'a'}

  Enter starting node: quit

Functions and Script

  • read_graph has an open (file) parameter; it returns the dict representing the graph (mine is 9 lines).
  • print_graph has a dict parameter (representing the graph); it returns nothing, but it prints the graph in the appropriate form (mine is 4 lines).
  • reachable has a dict parameter (representing the graph) and a str start node in the graph (technically a key in the dict; it returns a set of all the nodes reachable from it by following edges in the graph (mine is 9 lines).
  • Write a script at the bottom of this module that calls these functions to solve the problem (mine is 7 lines).

    I am supplying these number of lines not as requirements, but ballpark estimate of the amount of code you should write.

    Here is the basic algorithm for computing reachability; it is simple to explain and not (very) complicated to implement. But, you have to understand these instructions and carefully translate them into Python. You should hand-simulate this algorithm using the graph above, and verify that it produces the results you expect before coding it. You might be tempted to use recursion, but please don't: unless recursion is done very carefully, it will run forever on graphs with cycles: one of the input files is a graph with cycles.

    1. To compute all the reachable nodes in a graph, create a set (initially empty) of reached nodes and a list (initially containing the parameter start node) of nodes that we are going to explore (to find nodes they can reach).

    2. While the exploring list still has nodes, remove the first one (recall the pop method for lists) and put it into the reached set; for all its destination nodes that are not already in the reached set, put them in the exploring list.

    3. When the exploring list becomes empty (can you argue that this always will happen -there is no infinite looping), return the reached set.

    Print the set containing all these node labels. When debugging this algorith, print the set and list after every interesting change, or use the debugger to see how these change.

Sample Interaction

The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should "match" this one (sets will match if they have the same contents, independent of their order). You should also check that it works for other starting nodes, and a variety of starting nodes in the other graphs.
  Enter file with graph: graph1.txt

  Graph: source -> {destination} edges
    a -> {'c', 'b'}
    b -> {'d'}
    c -> {'f', 'e'}
    d -> {'g'}
    e -> {'d'}
    f -> {'g', 'd'}

  Enter starting node: e
  From e the reachable nodes are {'g', 'e', 'd'}

  Enter starting node: x
    Entry Error: 'x';  Not a source node
    Please enter a legal String

  Enter starting node: a
  From a the reachable nodes are {'g', 'f', 'e', 'd', 'c', 'b', 'a'}

  Enter starting node: quit

Problem #2: Instant Runoff Voting

Problem Summary:

  • Write a script that prompts the user to enter the name of a file representing the preference of a sequence of voters.
  • Read the information in the file, storing it in a dictionary.
  • Print the voter preference.
  • Repeatedly display the vote count for ballots (sorted both by candidate and numerically), eliminating from the election the candidate receiving the fewest votes, until one candidate (the winner) or no candidates (a tie) remain.
This form of election is known as instant runoff voting. Every voter submits a ballot that ranks all the candidates the election, from from most favorite candidate to least favorite (we will use a list for this purpose: earlier in the list means more favored). During the first ballot, votes are counted for each of the candidates according to the rankings of the voters. Then the candidate(s) with the fewest number of votes are removed from the election: if more than one candidate receives the least number of votes, they all are removed from the election. During the second ballot, votes are tallied for the remaining candidates (there are at least 1 fewer candidates); if a voter's first ranked candidate is not still in the election, then his/her second ranked candidate should receive the vote; but if his/her second ranked cnadidate has been removed from the election, then his/her third ranked candidate should receive the vote ...). This ballot process continues until either 1 candidate remains, or 0 candidates remain (meaning that all the remaining candidates tallied the same number of votes).

Input and Output

Read a file of voters and their ranking of the candidates, separated by semicolons, building a dict whose key is each voter and whose value is a list of candidates ranked by that voter (they appear in the file in order, from most favorite to least favorit) We write this informally as dict[str] -> list[str].

For example, the input file votepref1.txt contains the following lines (which could appear in this order, or any other):

  A;X;Y;Z
  B;Y;Z;X
  C;Y;Z;X
  D;Z;Y;X
  E;Z;Y;X
The first line means, voter A ranks candidate X first, candidate Y second, and candidate Z third. The second line means, voter B ranks candidate Y first, candidate Z second, and candidate X third. Each line will have a unique voter and a permutation of all the candidates running.

Print all the associations in this dict, one per line (the voters are printed alphabetically) using the following form. Each line contains the voter and his/her complete ranking of the candidates. For example, the file above would produce:

  Voter Preferences
     A  ->  ['X', 'Y', 'Z']
     B  ->  ['Y', 'Z', 'X']
     C  ->  ['Y', 'Z', 'X']
     D  ->  ['Z', 'Y', 'X']
     E  ->  ['Z', 'Y', 'X']

Note that the voter names are sorted alphabetically, but the list of preference appear the same order they appeared in the file. There are multiple data files for this program: votepref1.txt, votepref2.txt and votepref3.txt; test/debug your program on the first file; when you are done, test it on the rest.

Start with all the candidates. Evaluate the ballot to determine how many votes each candidate received. Print this vote count two ways: sorted alphabetically and sorted numerically (in decreasing order). Remove the candidate(s) receiving the fewest votes, and repeat this process until only one or no candidates remain. Finally, print the outcome of the election: a single candidate winner or a tie. An example interaction (processing the graph above) might be

  Vote count on ballot #1 with candidates (alphabetically) = {'X', 'Y', 'Z'}
    X  ->  1
    Y  ->  2
    Z  ->  2

  Vote count on ballot #1 with candidates (numerical) = {'Y', 'X', 'Z'}
    Y  ->  2
    Z  ->  2
    X  ->  1

  Vote count on ballot #2 with candidates (alphabetically) = {'Y', 'Z'}
    Y  ->  3
    Z  ->  2

  Vote count on ballot #2 with candidates (numerical) = {'Y', 'Z'}
    Y  ->  3
    Z  ->  2

  Winner is  {'Y'}
The first ballot consisted of all three candidates, X, Y, and Z. For this ballot, the votes were counted and printed; candidate X received the fewest number of votes so is eliminated from the next ballot. The second ballot consisted of two candidates, Y and Z. For this ballot, the votes were counted and printed; candidate Z received the fewest number of votes so is eliminated from the next ballot. There is only one candidate remaining so Y is declared the winner. An alternative outcome might be No winner: election is a tie among candidate remaining on the last ballot

Functions and Script

  • print_dict has a str title, any kind of dict, a function(default None) and bool (default False)as parameters and returns nothing; but it prints the title followed by the dictionary in the appropriate form and order (specified by the function and bool). The function determines the ordering and the bool determines whether to reverse it: like the key and reverse parameters used to sort in Python. This function is used to print both the voter preference dict and the vote count dict for each ballot (mine is 4 lines).
  • read_voter_preferences has an open (file) parameter; it returns the dict representing each voter and his/her preferences: dict[str] -> list[str] (mine is 6 lines).
  • evaluate_ballot has a dict of voter preference (dict[str] -> list[str] read above) and a set of the remaining candidates as parameters; it returns a dictionary whose keys are these candidates and whose values are the number of votes they received on this ballot, based on the description of instant runnoff voting: dict[str] -> int. Remember to count only one vote per voter, for his/her highest ranked candidate stil in the election (mine is 8 lines).
  • remaining_candidates has a dict as a parameter whose keys are candidates and whose values are the number of votes they received (dict[str] -> int); it returns a set containing all those candidates remaining in the election (the one(s) receiving the fewest number of votes are absent). Note that if all the candidates receive the same number of votes, then this function returns an empty set. (mine is 3 lines).
  • Write a script at the bottom of this module the calls these functions to solve the problem (mine is 13 lines).

Sample Interaction

The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should match this one.
  Enter file with voter preferences: votepref1.txt

  Voter Preferences
    A  ->  ['X', 'Y', 'Z']
    B  ->  ['Y', 'Z', 'X']
    C  ->  ['Y', 'Z', 'X']
    D  ->  ['Z', 'Y', 'X']
    E  ->  ['Z', 'Y', 'X']
  
  Vote count on ballot #1 with candidates (alphabetically) = {'Z', 'Y', 'X'}
    X  ->  1
    Y  ->  2
    Z  ->  2

  Vote count on ballot #1 with candidates (numerical) = {'Y', 'X', 'Z'}
    Y  ->  2
    Z  ->  2
    X  ->  1

  Vote count on ballot #2 with candidates (alphabetically) = {'Y', 'Z'}
    Y  ->  3
    Z  ->  2

  Vote count on ballot #2 with candidates (numerical) = {'Y', 'Z'}
    Y  ->  3
    Z  ->  2

  Winner is  {'Y'}

You can also try processing the votepref2.txt file (which leads to a No winner result) and votepref3.text.


Problem #3: Finite Automata

Problem Summary:

  • Write a script that prompts the user to enter the name of a file representing a finite automaton: indicating its states and input->state transitions.
  • Read the information in the file, storing it in a dictionary.
  • Print the finite automaton.
  • Prompt the user for to enter the name of a file storing the start-state and inputs to process (each line in the file contains this combination).
  • Repeatedly process these lines computing the results of the finite automaton on each input, and then display the results.
A finite automaton (FA) is an machine that is sometimes called Deterministic Finite Automaton (DFA). An FA is described by its states and its transitions: each transition for a state specifies an input and what state in the FA that input leads to. We can illustrate a FA as a graph with labelled edges (see below).

Input and Output

Read a file that describes a FA: each line contains a state and an arbitrary number of input->state transitions. Build a dict such that each keys is a str state and whose values is another dict specifying of the transitions from that state: this second dict has keys that are str inputs and values are str states. The first token on each line is the str state and the remaining tokens (always coming in pairs) are str inputs and states. All tokens are separated by one semicolon character. We write this informally as dict[str] -> (dict[str] -> str).

For example, the input file faparity.txt contains the following lines (which could appear in this order, or any other):

  even;0;even;1;odd
  odd;0;odd;1;even
Here is a picture of the parity FA. It graphically illustrates the two states (even and odd) and their transitions, using inputs (0 and 1) that always lead back to one of these two states.

Here, the state even (meaning it has seen an even number of 1 inputs so far) is a key in the main dict. It's value is a dict with two key/value pairs 0/even and 1/odd. It means that in the even state, if the input is a 0 the FA stays in the even state; if the input is a 1 the FA goes to the odd state. And similarly (the next line) means that for the odd state, if the input is a 0 the FA stays in the odd state; if the input is a 1 the FA goes back to the even state. So, seeing an input of 0 keeps the FA in the same state; seeing an input of 1 flips the FA into the other state.

Print the finite automaton, one state (and its transitions) per line; the states are printed alphabetically and the transition dict for each state is printed as a list of input/state items (tuples) such that these are printed alphabetically by the inputs.

For example, the file above would produce:

  Finite Automaton Description
    even transitions: [('0', 'even'), ('1', 'odd')]
    odd transitions: [('0', 'odd'), ('1', 'even')]

Note that there are multiple data files for this program: faparity.txt and fadivisibleby3.txt; test/debug your program on the first file; when you are done, test it on the last file. Draw the FA represented by each for to ensure that your code correctly prints and computes with it.

Repeatedly process lines from a second input file, computing the results of the finite automaton for a start-state and its inputs; then print out all the results in a special form. Each line in the file contains a start-state followed by a sequence of inputs. The start-state will be a state in the FA (is a key in the outer dictdict).

For example, the input file fainputparity.txt contains the following three lines:

  even;1;0;1;1;0;1
  odd;1;0;1;1;0;1
  even;1;0;1;1;0;x
For example, the first line means, the start-state is even and the inputs are 1, 0, 1, 1, 0, and 1.

The result of processing each line is to print the start-state, and then each input and the new state it transitions to, and finally print the stop-state. For the parity FA and the first line in this file, it should print

Start state = even
  input = 1; new state = odd
  input = 0; new state = odd
  input = 1; new state = even
  input = 1; new state = odd
  input = 0; new state = odd
  input = 1; new state = even
Stop state = even

Functions and Script

  • read_fa has an open (file) parameter; it returns the dict representing the finite automaton; hint: I used splicing and the zip function to build the inner dicts. (mine is 6 lines).
  • print_fa has a dict parameter (representing the fa); it returns nothing, but it prints the fa in the appropriate form (mine is 4 lines).
  • process has a dict parameter (representing the fa), a str parameter (representing the start-state), and a list parameter (representing a list of str inputs); it returns a list that contains the start-state followed by tuples that show the input and resulting state after each transition. For the example shown above, process returns the following list.
    ['even', ('1', 'odd'), ('0', 'odd'), ('1', 'even'), ('1', 'odd'), ('0', 'odd'), ('1', 'even')]
    Finally, if an input is illegal (is not the key in some transition for the current state), say 'x', then process should terminate with the last tuple in the list indicating a problem: ('x', None) (mine is 10 lines).
  • interpret has a list parameter (the result produced by process; it returns nothing, but it prints the results of processing a fa on an input. See how it prints the list shown above in the output further above. Also see the Sample Interaction below to see how it prints input errors (in the last example) (mine is 9 lines).
  • Write a script at the bottom of this module that calls these functions to solve the problem. Note that the script loops over the lines in the second file (mine is 7 lines).

Sample Interaction

The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should match this one.
  Enter file with finite automaton: faparity.txt

  Finite Automaton
    even transitions: [('0', 'even'), ('1', 'odd')]
    odd transitions: [('0', 'odd'), ('1', 'even')]

  Enter file with start-state and input: fainputparity.txt
  
  Starting new simulation
  Start state = even
    Input = 1; new state = odd
    Input = 0; new state = odd
    Input = 1; new state = even
    Input = 1; new state = odd
    Input = 0; new state = odd
    Input = 1; new state = even
  Stop state = even
  
  Starting new simulation
  Start state = odd
    Input = 1; new state = even
    Input = 0; new state = even
    Input = 1; new state = odd
    Input = 1; new state = even
    Input = 0; new state = even
    Input = 1; new state = odd
  Stop state = odd
  
  Starting new simulation
  Start state = even
    Input = 1; new state = odd
    Input = 0; new state = odd
    Input = 1; new state = even
    Input = 1; new state = odd
    Input = 0; new state = odd
    Input = x; illegal input; terminated
  Stop state = None

Problem #4: Non-Deterministic FA

Problem Summary:

  • Write functions and a script that solve for a Non-Deterministic Finite Automaton the same problem that was solved for a Deterministic Finite Automaton in Problem #3 (above).

    A non-deterministic finite automaton (NDFA) is machine described by its states and its transitions: each transition for a state specifies an input and what state (or states: that what makes it non-deterministic) that input leads to. We can illustrate a NDFA as a graph with labelled edges (see below). The critical difference is that a NDFA can have multiple edges with the same label going to different states (we'll see how to handle such transitions below).

    Input and Output

    Read a file that describes a NDFA: each line contains a state and an arbitrary number of input->state transitions. Build a dict such that each keys is a str state and whose values is another dict specifying of the transitions from that state: this second dict has keys that are str inputs and values are sets of str states: all the states a particular input can lead to. The first token on each line is the str state and the remaining tokens (always coming in pairs) are str inputs and states. All tokens are separated by one semicolon character. We write this informally as dict[str] -> (dict[str] -> set{str}).

    For example, the input file ndfaendin01.txt contains the following lines (which could appear in this order, or any other):

      start;0;start;1;start;0;near
      near;1;end
      end
    Here is a picture of the endin01 NDFA. It graphically illustrates the three states (start, near, and end) and their transitions, using inputs (0 and 1).

    Here, the state start is a key in the main dict. It's value is a dict with two key/value pairs 0 mapping to the set containing start and near and 1 mapping to the set containing just start. It means that in the start state, if the input is a 0 the NDFA can stay in the start state or it can go to the near state; if the input is a 1 the NSFA must stay in the start state. And similarly the next line means that in the near state, if the input is a 1 the NDFA must go into the end state. The last line means that the end state has no transitions out of it.

    Print the ndfa, one state (and its transitions) per line; the states are printed alphabetically and the transition dict for each state is printed as a list of input/set of state items (tuples) such that these are printed alphabetically by the inputs. Note that the state end is a key in the main dict, whose associated transitions are an empty dict

    For example, the file above would produce:

      Non-Deterministic Finite Automaton
        end transitions: []
        near transitions: [('1', {'end'})]
        start transitions: [('0', {'near', 'start'}), ('1', {'start'})]

    Note that there is only one pair of data files for this program.

    Repeatedly process lines from a second input file, computing the results of the non-determinisitc finite automaton for a start-state and its inputs; then print out all the results in a special form. Each line in the file contains a start-state followed by a sequence of inputs. The start-state will be a state in the FA (is a key in the outer dictdict).

    For example, the input file ndfainputendin01.txt contains the following two lines:

      start;1;0;1;1;0;1
      start;1;0;1;1;0;0
    For example, the first line means, the start-state is start and the inputs 1, 0, 1, 1, 0, and 1.

    The result of processing each line is to print the start-state, and then each input and the new states (plural) it could transition to (the could is what makes it non-deterministic), and finally print the stop-states. For the ndfaendin01 NDFA and the first line in this file, it should print

      Start state = start
        Input = 1; new possible states = {'start'}
        Input = 0; new possible states = {'near', 'start'}
        Input = 1; new possible states = {'end', 'start'}
        Input = 1; new possible states = {'start'}
        Input = 0; new possible states = {'near', 'start'}
        Input = 1; new possible states = {'end', 'start'}
      Stop state(s) = {'end', 'start'}

    Note especially that in the start state, if the input is a 0, then the NDFA can either remain in the start state or go into the near state. For this program, we keep track of all states that the NDFA can be in, using a set of new possible states. For the next input, 1, we can be either in the start state (from the start state, an input of 1 allows us to stay in the start state) or the end state (from the near state, an input of 1 allows us to transition to the end state). Thus, we keep track of the set of states the NDFA can be in, and the new set of states the NDFA can be in after processing the next input. In this example, because 'end' is included in the stop-states, this input does end in 01.

    Functions and Script

    • read_ndfa has an open (file) parameter; it returns the dict representing the non-deterministic finite automaton; hint: I used splicing and the zip function to build the inner dicts, and I called the setdefault function for the inner dict (alternatively I could have built them as defaultdicts: see the standard collections module) (mine is 9 lines).
    • print_ndfa has a dict parameter (representing the ndfa); it returns nothing, but it prints the ndfa in the appropriate form (mine is 4 lines).
    • process has a dict parameter (representing the ndfa), a str parameter (representing the start-state), and a list parameter (representing a list of str inputs); it returns a list that contains the start-state followed by tuples that show the input and resulting st of states after each transition. For the example shown above, process returns the following list.
        ['start', ('1', {'start'}), ('0', {'start', 'near'}), ('1', {'start', 'end'}), ('1', {'start'}),
          ('0', {'start', 'near'}), ('1', {'start', 'end'})]
      Finally, if an input is illegal (is not the key in some transition for the current state), just ignore it (mine is 11 lines; 7 using a double-comprehension).
    • interpret has a list parameter (the result produced by process; it returns nothing, but it prints the results of processing a ndfa on an input. See how it prints the list shown above in the output further above (mine is 5 lines).
    • Write a script at the bottom of this module that calls these functions to solve the problem. Note that the script loops over the lines in the second file (mine is 7 lines).

    Sample Interaction

    The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should "match" this one (recall the order of values in sets is not important).
      Enter file with non-deterministic finite automaton: ndfaendin01.txt
    
      Non-Deterministic Finite Automaton
        end transitions: []
        near transitions: [('1', {'end'})]
        start transitions: [('0', {'near', 'start'}), ('1', {'start'})]
    
      Enter file with start-state and input: ndfainputendin01.txt
    
      Starting new simulation
      Start state = start
        Input = 1; new possible states = {'start'}
        Input = 0; new possible states = {'near', 'start'}
        Input = 1; new possible states = {'end', 'start'}
        Input = 1; new possible states = {'start'}
        Input = 0; new possible states = {'near', 'start'}
        Input = 1; new possible states = {'end', 'start'}
      Stop state(s) = {'end', 'start'}
      
      Starting new simulation
      Start state = start
        Input = 1; new possible states = {'start'}
        Input = 0; new possible states = {'near', 'start'}
        Input = 1; new possible states = {'end', 'start'}
        Input = 1; new possible states = {'start'}
        Input = 0; new possible states = {'near', 'start'}
        Input = 0; new possible states = {'near', 'start'}
      Stop state(s) = {'near', 'start'}

  • Problem #5: Word Generator

    Problem Summary:

    • Write a script that prompts the user to enter the order statistic (a positive number) and the name of a file of text.
    • Read the file of text, storing a special corpus dictionary.
    • Print the corpus dictionary.
    • Prompt the user to enter the orderstatistic number of words, and the number of words to generate, then print a list of that many words randomly generated from the corpus.
    Your program will "learn" the word pattern of an author (based on some "order statistic" and reading a large sample of the author's writing) and then generate random text following the author's word patterns.

    Input and Output

    After prompting for the order statistic, read a file of words, building a dict storing data described as: dict[tuple[strn]] -> list[str]. Here the dict's keys are tuples of n words (n is the order statistic) and each key's value is a list of all the words in the text that follow these words: e.g., if n were 2, the dict would contain a key for every pair of words appearing next to each other in the text, and a value that is a list of all the words following this key (no matter where the pair occurs, with NO DUPLICATES allowed in the values list).

    The easiest way to read the words one at a time is to use an iterator over the result returned by the function goody.read_file_values, which is passed an open object to read from. Build the dict by "prereading" n words (by calling next) into a list (assume that this is always possible; how might it not be?); then read the next word and put it in as a value associated with the list of pre-read words; then, drop the oldest word in the list, and add this next word to the end of the list (always keeping the list length at n); read the next word, and repeat the process for this word, continuing until there are no words to read. You can use a normal for iterator over the values remaining. Remember to convert this list to a tuple before using it as a key in the dict.

    For a simple example, the file ndfainput1.txt contains the following lines (it could have all this information on one line or more lines):

      a b c b a d c b a d
      c a a b a a d

    Print all the associations in the dict, one per line in standard lexical order; after printing all associations, print the length of the smallest and largest list that is a value in the dict. Each line contains an n word tuple, followed by the list of unique words that follow them in the text. In standard lexical order, the keys appear in order relative to the first word in the tuple; for all first words that are the same, they appear in order relative to the second word in the tuple; for all first and second words that are the same, they appear in order relative to the thrid word in the tuple; etc. (see the example below).

    For example, the file above would produce:

      Corpus
        ('a', 'a') can be followed by any of ['b', 'd']
        ('a', 'b') can be followed by any of ['c', 'a']
        ('a', 'd') can be followed by any of ['c']
        ('b', 'a') can be followed by any of ['d', 'a']
        ('b', 'c') can be followed by any of ['b']
        ('c', 'a') can be followed by any of ['a']
        ('c', 'b') can be followed by any of ['a']
        ('d', 'c') can be followed by any of ['b', 'a']
      min/max = 1/2

    For example, ('a','d') appears three times in the text above, twice followed by 'c' and once followed by nothing (at the end of the file); ('a','b') appears twince in the file above, first followed by 'c' and second followed by 'a'.

    Prompt the user for the words to start with (there are order statistic number of them) and the number of random words after that to generate. Produce the list of words and print it.

    A random 10 word list, after the words a and d might print as

        Random text = ['a', 'd', 'c', 'a', 'a', 'd', 'c', 'a', 'a', 'd', 'c', 'b']
    In the result we start with a d (specified by the user), we know only c can come next; then using d c we know that either b or a must come next; it randomly chooses a...

    Functions and Script

    • read_corpus has an order statistic parameter and and open (file) parameter; it returns the dict representing the corpus of words in a file (mine is 9 lines).
    • print_corpus has a dict parameter (representing the corpus); it returns nothing, but it prints the corpus in the appropriate form followed the min and max value list lengths (mine is 8 lines).
    • produce text has a dict parameter (representing the corpus), a list parameter (representing the starting words), and an int count parameter (representing the number of additional words to generate); it returns a list that contains the the staring words followed by the generated words. Hint: use two lists (list[str]) both starting out with these starting words words. The first will always contain the current n words (to be coverted to a tuple and used as a key in the dict); the second will contain all the generated words. Generate a random next word from the dict using the choice function in the random modulesb: add it to both lists; then, drop the first word from the first list, so it remains a list of length n; repeat until you have generated the required number of words. Warning: you might have to stop prematurely if you generate the last n words in the text, and if these words occur nowhere else. That is because in this case, there is no random word to generate following them; in this case append a None to the end of the list of words and immediately return that list (mine is 11 lines). A slightly more elegant solution in Python uses only one list, copying the last order-statistic values of it into a tuple when needed for a key to the dictionary (e.g., l[-os:]; mine using this approach is 9 lines).
    • Write a script at the bottom of this module that calls these functions to solve the problem. (mine is 8 lines).

    Sample Interaction

    The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should match this one.
      Enter order statistic: 2
      Enter file to process: wginput1.txt
      Corpus
        ('a', 'a') can be followed by any of ['b', 'd']
        ('a', 'b') can be followed by any of ['c', 'a']
        ('a', 'd') can be followed by any of ['c']
        ('b', 'a') can be followed by any of ['d', 'a']
        ('b', 'c') can be followed by any of ['b']
        ('c', 'a') can be followed by any of ['a']
        ('c', 'b') can be followed by any of ['a']
        ('d', 'c') can be followed by any of ['b', 'a']
      min/max = 1/2
      
      Enter 2 words to start with
      Enter word 1: a
      Enter word 2: d
      Enter # of words to generate: 10
      Random text = ['a', 'd', 'c', 'a', 'a', 'd', 'c', 'a', 'a', 'd', 'c', 'b']

    You can also try reading a much larger file included in this project folder huck.txt, Mark Twain's, "The Adventures of Huckleberry Finn". I tried it with an order statistic of 3. The corpus has over 90,000 entries; the biggest key triple had an associated value with 46 uniqeue words in it. The key was ('out', 'of', 'the') and its associated value was the list

      ['window', 'face', 'woods', 'fourth', 'front', 'jacket', 'hole', 'canoe', 'middle',
       "ferryboat's", 'cottonwood', "captain's", 'river', 'fog', 'range', 'presbyterian', 'tree',
       'nest', 'wagon-troughs', 'reach', 'store', 'way', 'wigwam', 'ark', 'room', 'corner', 'grave',
       'nonesuch', 'trouble', 'kitchen', 'old', 'first', 'hardest', 'nigger-patch', 'sugar-bowl',
       'window-hole', 'brass', 'spoon', 'house', 'tooleries', 'bag', 'office', 'post-office', 'cabin',
       'path', 'chains']

    With the appropriate modification, we can use this same program to read/generate music or DNA sequences or any other data made of symbols.


    Extra Credit Write up all three answers in and drop them off on checkmate.
    1. Specify a fa that is equivalent to the ndfa that finds inputs that end in 01. Ending in which state signifies inputs ending in 01?
    2. What is interesting about the results of the election using the votepref3.txt input file? How could we add another voter, such that his/her preferences allow X to win on the first ballot?
    3. In the word generator program we used a dict value that was a list of words that was to contain no duplicates? Why can't we just use a set (what would break in our code)? If we wanted to use a set how could we modify the code to work correctly (what is the smallest and/or fastest modification)?