Program 1

Iteration and Major Data Types:
List, Tuple, Set, and Dict (and Open for files)

ICS-33: Intermediate Programming


Introduction This programming assignment is designed to ensure that you know how to use combinations of Python's most important data types to model and compactly write code that solves a wide variety of different programming problems. The kind of abstract thinking that goes into modeling solutions to these programming problems with these data types (and iteration over them) is critical to your development as computer scientists.

There are five parts to this assignment. In each you will be asked to write a module that contains a few functions and a script at the bottom, which ties these functions together to solve the problem.

You should download the program1 project folder and use it to create an Eclipse project. The project folder contains files for all the modules in which to write your functions and scripts; it also contains all the data files that you need to test/debug your modules; finally, it contains all the batch self-check files I will use when grading your programs. In your modules, you may import additional standard/courselib modules and you may write additional helper functions.

I recommend that you work on this assignment in pairs, and I recommend that you work with someone in your lab section (so that you have 4 hours each week of scheduled time together). These are just recommendations. Try to find someone who lives near you, with similar programming skills, and work habits/schedule: e.g., talk about whether you prefer to work mornings, nights, or weekends; what kind of commitment you will make to submit program early.

Only one student should submit all parts of the the assignment, but both students' UCInetID and name should appear in a comment at the top of each submitted .py file. A special grading program reads this information. The format is a comment starting with Submitter and Partner (when working with a partner), followed by a colon, followed by the student's UCInetID (in all lower-case), followed by the student's name in parentheses (last name, comma, first name -capitalized appropriately). If you omit this information, or do not follow this exact form, it will require extra work for us to grade your program, so we will deduct points.

For example if Romeo Montague (whose UCInetID is romeo1) submitted a program that he worked on with his partner Juliet Capulet (whose UCInetID is jcapulet) the comment at the top of each .py file would appear as:

# Submitter: romeo1(Montague, Romeo)
# Partner  : jcapulet(Capulet, Juliet)
# We certify that we worked cooperatively on this programming
#   assignment, according to the rules for pair programming
If you do not know what the terms cooperatively and/or rules for pair programming mean, please read about Pair Programming before starting this assignment. Please turn in each program as you finish it, so that I can more accurately assess the progress of the class as a whole during this assignment.

Print this document and carefully read it, marking any parts that contain important detailed information that you find (for review before you turn in the files). The code you write should be as elegant and compact as possible, using appropriate Python idioms. You should familiarize yourselves with the safe_open function in the goody module and all the functions in the prompt module, both of which you should have installed in your courselib folder as part of the Eclipse/Python installation. Recall how to use the sep and end parameters in the print function.

Reread the section on Time Management from Programming Assignment 0 before starting this assignment.

IMPORTANT 1: Before starting this assignment, download the xref project folder which contains a small Python script xref.py that produces a cross-reference of all the words (converted to lower case) in a file (where words appear with spaces between them: see xrefin.txt for an example): the words are listed in alphabetic order followed by a set (i.e., no duplicates) of the line numbers it appears on (listed in increasing numeric order). Before solving the problems in this programming assignment, ensure you understand all the details of how this program works: look at features and functions like safe_open, defaultdict (and how it is used), enumerate, rstrip and lower, split and join, sorted, for loops with two (unpacked) indexes, the two comprehensions (in the call to max and join), and format. These are the building blocks for many parts of this assignment; explore and experiment with this code to understand how all the parts work together to achieve the desired result. Run this code on more complicated data files.

IMPORTANT 2: This assignment has 5 parts: pairs should work on each part together, not split them up and do them separately. Parts 1-3 are going to be worth 12 points each; parts 4-5 are to be worth 7 points each. This skewing of points towards the simpler parts means students finishing the first 3 parts correctly will yield a 72% average; those finishing the first 4 parts correctly will have an 86% average; but to get an A on this assignment requires solving all parts correctly. Parts 1-2 have a due date before Parts 3-5. I strongly recommend finishing the first part by the weekend, and then finishing another part every few days. In the past, many students waited until the last few days and then tried to write all their solutions: that is a recipe for learning little and getting a poor grade (or worse, cheating and being caught). So, now I have two different due dates for Parts 1-2 and Parts 3-5. Remember that I'm going to be running MOSS on all the parts of this assignment, checking for duplicate solutions.

IMPORTANT 3: I will mostly grade all these programs automatically, using the batch self-check files provided in the download. Use the driver program (explored in Programming Assignment #0) to run the batch-self check files in this assignment; debug any errors that they produce. But the TAs (with some automated tools) will also look at/run the code in some of your scripts: so the scripts need to follow exactly what is shown in the Sample Interactions part for each problem. I suggest testing your code first to match the scripts; when those results are correct, test it using the batch self-check files. Finally, if a submitted Python module contains even one syntax error or bad import, it will fail all its batch self-checks; ensure that you submit modules with no syntax or bad import errors (Python sometimes adds strange imports at the top of your file; ensure that all your imports are reasonable).


Problem #1: Influencers

Problem Summary:

Write the required functions and script that prompts the user for the name of a file representing a graph; reads the file (storing the graph in a dictionary); prints the graph/dictionary in a special form; computes an approximation to the minimmum influencers (a special set of node names in the graph); repeatedly prompts the user for a set of node names in the graph (rejecting sets that specify any node name not in the graph); computes and prints all the nodes that are "influenced" (see below) by the node names in the set.

We can use a graph to describe friendships or professional relationships, as Facebook and LinkedIn do. Each node in a friendshiop graph is the name of a person, and that person has edges (arrows) leading from his/her node to the nodes of all his/her friends. In this model, friendship is symmetric (goes both ways): if a is b's friend (there is an arrow from a to b), then b is a's friend (there is an arrow from b to a). For simplicity, we will represent these two arrows by one double arrow: a single line with arrowheads at both ends. Such a graph, with all double-arrow edges, is called undirected. The undirected graph below shows all the friendships among the names a-k: note tha there k is edgeless, meaning he/she has no friends.

So, for example, in the above graph c has 4 friends: a, b, d, and g. Likewise g has 3 friends: c, h, and j. Of course, because c has g as a friend, g has c as a friend: see the double arrow. Note that h has just 1 friend: g. Finally k has 0 (no) friends.

Assume that people in a friendship graph can influence their friends: specifically, assume in this example that if half or more of a person's friends like a song, then that person will decide (be influenced) to like the song too. For example, c has 4 friends, so if 2 or more of c's friends like a song, then c will be influenced to like it too; it doesn't matter which of c's friends like the song, so long as 2 or more of his/her friends like it.

How does this rule apply if a person has an odd number of friends? For example, g has 3 friends, so half for g's friend would be 3/2 = 1.5 friends; in this case, we round upward to an integer and require 2 or more of g's 3 friends to like the song in order to influence g to like it.

In fact Python's math.ceil (ceiling) function computes and returns the smallest integer >= to its argument: so, calling math.ceil(1.5) returns 2, and calling math.ceil(2) also returns 2. Therefore, math.ceil(#friends/2), whether #friends is even or odd, correctly computes the minimum number of friends (an int) needed to influence that person. Remember ceil: you will use it in your code; it is already imported into the module you will download, into which you will write your code.

Let's see how a few people's influence can spread through this friendship graph. To start, let's suppose that that person b and e like a song. Here is how they can influence their friends, illustrated and explained below.

Computing which friends are influenced by b and e:

  1. At the start, only b and e like a song; if a person likes a song, we will write a * after their name in the graphs above, so the node names b* and e* appear in the first graph.
  2. a now likes the song, because 1 of a's 2 friends (>= half) like the song: b.
  3. c now likes the song, because 2 of c's 4 friends (>= half) like the song: a and b.
  4. d now likes the song, because 2 of d's 3 friends (>= half) like the song: c and e.
  5. f now likes the song, because 1 of f's 1 friends (>= half) like the song: d.
As this point, no other people are influenced by enough of their friends to like the song; specifically, g needs 2 of his/her 3 friends to to inluence him/her to like the song, but only 1 of his/her friend likes it. So b and e were not influential enough to make everyone in the graph like a song, but they did influence a, c, d, and f.

In fact, no pair of peopole wields enough influence for everyone in this graph to like a song (note that no one can influence k to like it). But if d, g, and k like a song, they can influence everyone else to like it too (check that this statement is true, to ensure that you understand the influencing process). This is the minimum number of people needed to influence everyone in this friendship graph. In this problem, we will learn how to represent friendship graphs in Python and implement an algorithm that computes a small set of nodes that can influence all the others. If we wanted to create a viral marketing campaign to promote say a new song (or any belief), we would concentrate our efforts on this set, because they could convince everyone else in their social network.

Input and Output:

We will store this graph in a dictionary: each person's/node's name will be a key (str) whose associated value is a set of str of the people/node names of his/her friends.

Read a file where each line is either one node name (a person with no identifiable friends) or a pair of node names (representing an undirected friendship edge) in the graph. Note that node names may be any number of characters (not just the single characters used in this example), separated by one semicolon character. Build a dictionary whose keys are str node names, and whose associated values are sets of str node names that are friends. We annotate this dictionary as {str:{str}}.

For example, the input file graph1.txt contains the following lines (which could appear in this order, or any other order, and still produce the same dictionary). By convention, we will put all friendless names at the end of the file.

  a;b
  a;c
  b;c
  c;d
  d;e
  d;f
  c;g
  g;h
  g;j
  i;j
  k
which represent the original friendship graph we examined above.

Print the graph, one node name per line followed by the set of all his/her friend's node names (alphabetically). This graph above would print as

  Graph: node -> list of all friend nodes
    a -> ['b', 'c']
    b -> ['a', 'c']
    c -> ['a', 'b', 'd', 'g']
    d -> ['c', 'e', 'f']
    e -> ['d']
    f -> ['d']
    g -> ['c', 'h', 'j']
    h -> ['g']
    i -> ['j']
    j -> ['g', 'i']
    k -> []

Note that the node names must be sorted alphabetically; and, the set of associated node names must appear in a list whose values are sorted alphabetically: we use a list, because it makes no sense to talk about sorted sets. Note that because node k appears in the input file on a line by itself, it has no friends. Remember that friendship is symmetric, so the line a;b means b is a's friend and a is b's friend; this means the node name b appears in node a's associated set and the node name a appears in node b's associated set.

There are multiple data files for this program: graph1.txt (shown above), graph2.txt and graph3.txt; test/debug your program on the first file; when you are done, test it on the remaining files. Draw the graph represented by each file to ensure that your code correctly prints it and computes the set of influencer nodes (which you can do by eyeballing the graphs: they are small).

Here is a description of the Influencer algorithm for computing a small set of nodes that can influence every node in the graph. It is guaranteed to produce a small set of nodes with the desired property, but it will not necessarily compute the smallest such set in all cases (that is a much harder problem). You must implement this Influencer algorithm, as it is described below.

It is fairly straightforward to specify the Influencer algorithm, which is straightforward to implement in Python. But, first you must understand these English instructions, and only then can you carefully translate them into Python code. You should hand-simulate this algorithm using the data above, and verify that it produces the results that you expect, before coding it in Python.

  1. Make a dictionary (I'll call it infl_dict here) whose keys are all the node names in the graph and whose associated values are 3-lists. Initially, each 3-list stores
    • at index 0: the number of friends of the node minus the number of friends needed to influence the node (this second value is the ceil of half the number of its friends); but, if a node contains no friends, the value stored at index 0 should be -1.
    • at index 1: the number of friends of the node
    • at index 2: the node name itself (duplicating the key)
    Index 0 for a node specifies its number of friends that can be removed while still allowing the node to be influenced by its remaining friends. For example Index 0 for g is computed as 3-ceil(3/2) = 1 meaning that node c can have 1 of its friends removed and still be influenced by its (2) remaining friends.

    If Index 0 stores a negative value, it cannot be influenced by its friends, so it must be part of the returned set of influencers: see the last pair of lines below.

  2. Repeat the following process until termination:
    1. Create a list of the 3-tuple values currently stored as 3-list values in infl_dict, but only if their index 0 values are non-negative; these are candidates for removal from the infl_dict, since they can still be influenced by their friends.
    2. Terminate the Influencer algorithm if there are no values in this list; otherwise ...
    3. Find the smallest 3-tuple in this list: use the min function (not via sorting). Because 3-tuples are unique, the minimum 3-tuple is unique.
    4. Remove the specified node name (see index 2) from infl_dict.
    5. For every friend of this node in the graph that is also still in infl_dict, decrement its the index 0 and index 1 values by 1 in its associated 3-list.

Upon termination, the keys remaining in infl_dict represent a small (but not necessarily the smallest) set of influencers for the entire friendship graph. Note that although infl_dict changes, the original dictionary storing the friendship graph remains unchanged.

Read these instructions carefully, a few times. Hand simulate these instructions to ensure that you understand the Influencer algorithm; use the data above, which is automatically traced in the example below. Do not attempt to write any Python code to solve this problem until you understand this algorithm and can apply it to the data specified above. Eventually you will write your Python code to produce such a trace conditionally.

Here is a trace (see the 2nd parameter to the find_influencers function described below, which activiates the trace) for the friendship graph specified above. The order of values in the dictionaries and lists are arbitrary; I have written these data structures on multiple lines for formating purposes..

  
  influencer dictionary = {'a': [1, 2, 'a'], 'b': [1, 2, 'b'], 'c': [2, 4, 'c'], 'd': [1, 3, 'd'],
                           'e': [0, 1, 'e'], 'f': [0, 1, 'f'], 'g': [1, 3, 'g'], 'h': [0, 1, 'h'],
                           'j': [1, 2, 'j'], 'i': [0, 1, 'i'], 'k': [-1, 0, 'k']}
  removal candidates    = [(1, 2, 'a'), (1, 2, 'b'), (2, 4, 'c'), (1, 3, 'd'), (0, 1, 'e'),
                           (0, 1, 'f'), (1, 3, 'g'), (0, 1, 'h'), (1, 2, 'j'), (0, 1, 'i')]
  (0, 1, 'e') is the smallest candidate
  Removing e as key from influencer dictionary; decrementing friend's values there

  influencer dictionary = {'a': [1, 2, 'a'], 'b': [1, 2, 'b'], 'c': [2, 4, 'c'], 'd': [0, 2, 'd'],
                           'f': [0, 1, 'f'], 'g': [1, 3, 'g'], 'h': [0, 1, 'h'], 'j': [1, 2, 'j'],
                         'i': [0, 1, 'i'], 'k': [-1, 0, 'k']}
  removal candidates    = [(1, 2, 'a'), (1, 2, 'b'), (2, 4, 'c'), (0, 2, 'd'), (0, 1, 'f'),
                           (1, 3, 'g'), (0, 1, 'h'), (1, 2, 'j'), (0, 1, 'i')]
  (0, 1, 'f') is the smallest candidate
  Removing f as key from influencer dictionary; decrementing friend's values there

  influencer dictionary = {'a': [1, 2, 'a'], 'b': [1, 2, 'b'], 'c': [2, 4, 'c'], 'd': [-1, 1, 'd'],
                           'g': [1, 3, 'g'], 'h': [0, 1, 'h'], 'j': [1, 2, 'j'], 'i': [0, 1, 'i'],
                           'k': [-1, 0, 'k']}
  removal candidates    = [(1, 2, 'a'), (1, 2, 'b'), (2, 4, 'c'), (1, 3, 'g'), (0, 1, 'h'),
                           (1, 2, 'j'), (0, 1, 'i')]
  (0, 1, 'h') is the smallest candidate
  Removing h as key from influencer dictionary; decrementing friend's values there

 influencer dictionary = {'a': [1, 2, 'a'], 'b': [1, 2, 'b'], 'c': [2, 4, 'c'], 'd': [-1, 1, 'd'],
                          'g': [0, 2, 'g'], 'j': [1, 2, 'j'], 'i': [0, 1, 'i'], 'k': [-1, 0, 'k']}
 removal candidates    = [(1, 2, 'a'), (1, 2, 'b'), (2, 4, 'c'), (0, 2, 'g'), (1, 2, 'j'),
                           (0, 1, 'i')]
  (0, 1, 'i') is the smallest candidate
  Removing i as key from influencer dictionary; decrementing friend's values there

  influencer dictionary = {'a': [1, 2, 'a'], 'b': [1, 2, 'b'], 'c': [2, 4, 'c'], 'd': [-1, 1, 'd'],
                           'g': [0, 2, 'g'], 'j': [0, 1, 'j'], 'k': [-1, 0, 'k']}
  removal candidates    = [(1, 2, 'a'), (1, 2, 'b'), (2, 4, 'c'), (0, 2, 'g'), (0, 1, 'j')]
  (0, 1, 'j') is the smallest candidate
  Removing j as key from influencer dictionary; decrementing friend's values there

  influencer dictionary = {'a': [1, 2, 'a'], 'b': [1, 2, 'b'], 'c': [2, 4, 'c'], 'd': [-1, 1, 'd'],
                           'g': [-1, 1, 'g'], 'k': [-1, 0, 'k']}
  removal candidates    = [(1, 2, 'a'), (1, 2, 'b'), (2, 4, 'c')]
  (1, 2, 'a') is the smallest candidate
  Removing a as key from influencer dictionary; decrementing friend's values there

  influencer dictionary = {'b': [0, 1, 'b'], 'c': [1, 3, 'c'], 'd': [-1, 1, 'd'], 'g': [-1, 1, 'g'],
                           'k': [-1, 0, 'k']}
  removal candidates    = [(0, 1, 'b'), (1, 3, 'c')]
  (0, 1, 'b') is the smallest candidate
  Removing b as key from influencer dictionary; decrementing friend's values there

  influencer dictionary = {'c': [0, 2, 'c'], 'd': [-1, 1, 'd'], 'g': [-1, 1, 'g'], 'k': [-1, 0, 'k']}
  removal candidates    = [(0, 2, 'c')]
  (0, 2, 'c') is the smallest candidate
  Removing c as key from influencer dictionary; decrementing friend's values there

  influencer dictionary = {'d': [-2, 0, 'd'], 'g': [-2, 0, 'g'], 'k': [-1, 0, 'k']}
  removal candidates    = []

When the algorithm terminates, the small influencer set is {'d', 'g', 'k'}: the node names remaining in the infl_dict whose associated 3-list stores a negative number at index 0.

Now, repeatedly prompt the user for a set of node names in the graph (until the user enters quit) and compute and print all the nodes these influence (as well as the percentage of nodes in the graph influenced): the default value for the prompt should be all the nodes computed above, which when entered should influence 100% of the nodes in the friendship graph (if they were computed correctly). Reject any set that contains a node name that is not a key in the graph.

Call the function all_influenced to compute this value: it uses the friendship graph and the set of node names the user enters to compute all the influenced nodes in the friendship graph as follows.

  1. Create a dictionary associating every node in the graph with a bool value telling whether it is currently influenced: initiallyl True if and only if it is in the set supplied to the influencers parameter; also compute the number of keys in this dictionary whose associated value is True (which initially is the length of the set).

  2. Repeat the following process until termination:
    1. For every item in the dictionary produced in Step 1, if it is not yet influenced (its associated bool is False) check to see whether it has enough friends (who are influenced) to influence it as well; if so, change its associated dictionary value to True.

      You will need to treat friendless nodes specially: they are never influenced because they have no friends.

    2. If the number of currently influenced nodes has not changed (from the previous time this value was computed), terminate by returning a set of all the dictionary keys whose associated value is True: all the influenced nodes.
An example interaction (processing the graph above) might be
  Enter influencers set (or else quit)[{'d', 'g', 'k'}]: {'b','d'}
  All Influenced (54.54545454545455% of graph)= {'c', 'b', 'a', 'f', 'e', 'd'}

  Enter influencers set (or else quit)[{'d', 'g', 'k'}]: {'a','x'}
    Entry Error: '{'a','x'}';  
    Please enter a legal String

  Enter influencers set (or else quit)[{'d', 'g', 'k'}]: 
  All Influenced (100.0% of graph)= {'h', 'c', 'b', 'a', 'i', 'f', 'j', 'e', 'd', 'g', 'k'}

  Enter influencers set (or else quit)[{'d', 'g', 'k'}]: quit

Functions and Script:

Write the following functions and script. I am providing line counts not as requirements, but to indicate the lengths of well-written Pythonic code.
  • read_graph has an open (file) parameter; it returns the dictionary representing the undirected friendship graph (body is 11 lines).

  • graph_as_str has a dictionary parameter (representing the friendship graph); it returns a multi-line string (each line is ended by '\n'), which when printed shows the contents of the graph in the appropriate textual form (body is 4 lines; can you do it in 1?).

  • find_influencers has a dictionary parameter (representing the graph), as well as a tracing parameter whose default value is False. This function uses the Influencer algorithm described above to compute/return a small set of nodes that can influence everyone in the friendship graph; if tracing is True it creates a trace in the form of the example trace shown above (body is 13 lines, but only 11 lines without tracing code).

  • all_influenced has a dictionary parameter (representing the graph) and a set of initially infuenced nodes. This function uses the algorithm described above to compute/return the set of nodes that can be influenced by this set (body is 10 lines).

  • Write a script at the bottom of this module (in if __name__ == '__main__':) that prompts the user to enter the file storing friendship graph; reads this file and creates the required dictionary; labels and prints the dictionary (using graph_as_str); prompts the user about whether to trace the algorithm, then computes (using find_influencers) and prints the set of influencer nodes; repeatedly prompts the user to enter a set of nodes (rejecting any set that contains a node not in the graph) or the word quit; calls all_influenced using the entered set, and prints the required information (body is 11 lines).

Sample Interaction:

The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should "match" this one (sets will match if they have the same contents, independent of their order). You should also check that it works for other graphs.
  Enter a file storing a friendship graph: graph.txt: 
  Graph: node -> list of all friend nodes
    a -> ['b', 'c']
    b -> ['a', 'c']
    c -> ['a', 'b', 'd', 'g']
    d -> ['c', 'e', 'f']
    e -> ['d']
    f -> ['d']
    g -> ['c', 'h', 'j']
    h -> ['g']
    i -> ['j']
    j -> ['g', 'i']
    k -> []

  Trace the Algorithm[True]: False
  Influencers = {'g', 'k', 'd'}

  Enter influencers set (or else quit)[{'g', 'k', 'd'}]: 
  All Influenced (100.0% of graph)= {'i', 'g', 'f', 'k', 'c', 'j', 'h', 'a', 'b', 'd', 'e'}

  Enter influencers set (or else quit)[{'g', 'k', 'd'}]: quit


Problem #2: Stable Marriage

Problem Summary:

Write the required functions and script that prompts the user for the names of two files: a file representing the marriage preferences of a sequence of men, and a file representing the marriage preferences of a sequence of women; reads these files (storing this information in special data structures: dictionaries storing preference as lists); prints the dictionary/preferences of the men and women in a special form; runs the Gale/Shapley algorithm for finding a stable marriage (tracing its progress, if required); prints the stable marriage as a set of man/woman tuples.

Suppose that N men and N women want to match in a heterosexual marriage. Each produces a list of his/her preferences, ranking all members of the opposite gender in highest to lowest order of acceptability as a partner. The Gale/Shapley algorithm (described in detail below) matches men and women in stable marriages. Marriages are stable when we cannot find a man and woman, who prefer each other to their match. This scenario can be used to find stable matches in other contexts. For example, this algorithm is used when medical school graduates match with hospitals for their residencies: the students and institutions rank each other and then the algorithm is run, processing these rankings. In this case, the residents propose (act as the men in the description above) and the hospitals accept/reject the proposals (act as the women).

The fundamental data structure used throughout this process (as both arguments to functions and the results returned by functions) is characterized by {str:[str,[str]]}, which describes a dictionary whose keys are associated with 2-lists. The dictionary keys are the names of men/women (str). Each man/woman is associated with a 2-list, whose first index is the current match of that person (str), and whose second index is a list ranking other possible matches (list of str), from highest to lowest preference.

For example, the following dictionary represents information about three men participating in marriages (m1, m2, m3):

  {'m1': [None, ['w3', 'w2', 'w1']],
   'm2': [None, ['w3', 'w1', 'w2']],
   'm3': [None, ['w2', 'w1', 'w3']]}
Here, m1 is currently matched with no one (None) and ranks the women, in order of preference, as follows w3 (hightes ranking) followed by w2 followed by w1 (lowest ranking). The other lines in this dictionary are interpreted similarly. In this example (and the ones below) I have printed each key/value pair (alphabetically by key) on its own line; of course, if we print a dictionary in Python, it can print its key/value pairs in any order.

The following dictionary represents information about three women participating in marriages (w1, w2, and w3):

  {'w1': [None, ['m1', 'm2', 'm3']],
   'w2': [None, ['m2', 'm1', 'm3']],
   'w3': [None, ['m3', 'm2', 'm1']]}
Here, w2 is currently matched with no one (None) and ranks the men, in order of preference, as follows m2 (highest ranking) followed by m1 followed by m3 (lowest ranking).

After running the Gale/Shapley algorithm (with men proposing to women, and women accepting or rejecting their proposals: more on these details later), these dictionaries are mutated to

  {'m1': ['w2', ['w3', 'w2', 'w1']],
   'm2': ['w3', ['w3', 'w1', 'w2']],
   'm3': ['w1', ['w2', 'w1', 'w3']]}

  {'w1': ['m3', ['m1', 'm2', 'm3']],
   'w2': ['m1', ['m2', 'm1', 'm3']],
   'w3': ['m2', ['m3', 'm2', 'm1']]}
Note the following invariant: if a man is matched to a woman in the man's dictionary, then that same woman must be matched to that man in the woman's dictionary. Verify that this is true above.

Are these marriages stable? First, let's look just at m1, who is matched to w2. By his preferences, he would rather marry w3, but she prefers m2 (her match) to m1. Now, let's look just at w1, who is matched to m3. By her preferences, she would rather marry m1, but he prefers w2 (his match) to w1; also w1 would prefer to marry m2, but he also prefers w3 (his match) to w1. If you check all the other men and women (do it) you will find that you can find no pair who would both rather marry each other, rather than their current matches, so these marriages are considered stable.

Input and Output:

Read files of men and women and their rankings of all members of the opposite gender (highest to lowest preference), separated by semicolons, building a dictionary like the ones above (where each match is initially the special value None). As described above, we annotate the structure of this dictionary as {str:[str,[str]]}.

In the file, the person's name appears first, followed by the names of all members of the opposite gender in highest to lowest preference, separated by one semicolon character. For example, the input file men0.txt contains the following lines: these line could appear in this order, or any other, but the each man's preferences must appear in decleasing order of preference.

  m1;w3;w2;w1
  m2;w3;w1;w2
  m3;w2;w1;w3
The first line means, m1 ranks the members of the opposite gender in the order of preference from w3, w2, and w1 in decreasing order of preference Each line is guaranteed to start with a unique name, which is guaranteed to be followed by all the names of all members of the opposite gender, each appearing once; and all names are separated by semicolons.

When you print such information, print each person on a separate line, followed by his/her match and preferences. For example, the file above would print as:

  m1 -> [None, ['w3', 'w2', 'w1']]
  m2 -> [None, ['w3', 'w1', 'w2']]
  m3 -> [None, ['w2', 'w1', 'w3']]

Note that the names on the lines must be sorted in alphabetical order; the list of preferences must appear in the same order they appeared in the file. There are multiple pairs of data files for this program, all named like men0.txt and women0.txt; Test/debug your program on the first file; when you are done, test it on the remaining files.

Here is a description of the Gale/Shapley Algorithm for computing a stable marriage. You must implement this algorithm, as it is described below. There might be other stable marriages, but this algorithm will always compute the same one. This algorithm is not symmetric: here men get to propose to women and women get to accept/reject men. If we ran the algorithm the other way (with women proposing to men, and men accepting or rejecting women, we would also get a stable marriage, but the matches might be different.

It is fairly straightforward to specify the Gale/Shapley algorithm, which is straightforward to implement in Python. But, first you must understand these English instructions, and only then can you carefully translate them into Python code. You should hand-simulate this algorithm using the data above, and verify that it produces the results that you expect, before coding the algorithm in Python.

  1. Make a copy of only the men's data structure: create a new dictionary that copies all data, including copying the lists, from the original dictionary. We make a copy because the algorithm below mutates the men's data structure, but we don't want its matching argument to change. In the steps below, muctate the copy, not the parameter.

  2. Make an unmatched set that contains the names of all unmatched men. Initially, all men are unmatched, so this set will contain all the men in the men's dictionary.

  3. Repeat the following process until there are no more unmatched men.
    1. Remove (see the pop operation on sets) any man from the set of unmatched men.

    2. Determine the woman that is highest on his preference list and remove that woman from his preference list (see the pop operation on lists). This man will try to propose to that woman.
      • If that woman is unmatched: match this man and that woman.
      • If that woman is matched, but pefers this man to her current match: unmatch that woman and her current match and add the man that she previously matched (he is now unmatched) to the set of unmatched men. Then, match this man and that woman.
      • If that woman is matched, and pefers her current match to this man, just add this man (still unmatched) back to the set of unmatched men.
Read these instructions carefully, a few times. Do not attempt to write any Python code to solve this problem until you understand this algorithm and can apply it to the data specified below. Hand simulate these instructions to ensure that you understand the algorithm; use the data above, which is automatically traced in the example below. Eventually you will write your Python code to produce such a trace conditionally.

Here is a trace (see the 3rd parameter to the make_match function below) for the men and women data structures specified above.

  Women Preferences (unchanging)
    w1 -> [None, ['m1', 'm2', 'm3']]
    w2 -> [None, ['m2', 'm1', 'm3']]
    w3 -> [None, ['m3', 'm2', 'm1']]

  Men Preferences (current)
    m1 -> [None, ['w3', 'w2', 'w1']]
    m2 -> [None, ['w3', 'w1', 'w2']]
    m3 -> [None, ['w2', 'w1', 'w3']]
 
  unmatched men = {'m2', 'm3', 'm1'} 

  m2 proposes to w3; the unmatched woman accepts the proposal

  Men Preferences (current)
    m1 -> [None, ['w3', 'w2', 'w1']]
    m2 -> ['w3', ['w1', 'w2']]
    m3 -> [None, ['w2', 'w1', 'w3']]
 
  unmatched men = {'m3', 'm1'} 

  m3 proposes to w2; this unmatched woman accepts the proposal

  Men Preferences (current)
    m1 -> [None, ['w3', 'w2', 'w1']]
    m2 -> ['w3', ['w1', 'w2']]
    m3 -> ['w2', ['w1', 'w3']]
 
  unmatched men = {'m1'} 

  m1 proposes to w3; this matched woman rejects the proposal (likes current match better)

  Men Preferences (current)
    m1 -> [None, ['w2', 'w1']]
    m2 -> ['w3', ['w1', 'w2']]
    m3 -> ['w2', ['w1', 'w3']]
 
  unmatched men = {'m1'} 

  m1 proposes to w2; this matched woman accepts the proposal, rejecting match with m3 

  Men Preferences (current)
    m1 -> ['w2', ['w1']]
    m2 -> ['w3', ['w1', 'w2']]
    m3 -> [None, ['w1', 'w3']]
   
  unmatched men = {'m3'} 

  m3 proposes to w1; this unmatched woman accepts the proposal

  algorithm terminated with matches = {('m1', 'w2'), ('m3', 'w1'), ('m2', 'w3')}
The resulting matches form stable marriages (the ones discussed above, when we discussed the meaning of stability). When this algorithm terminates, the local copy of the men's data structure has become
  m1 -> ['w2', ['w1']]
  m2 -> ['w3', ['w1', 'w2']]
  m3 -> ['w1', ['w3']]

Note that each man's preference list shows only the women he did not propose to. Finally, if we used the same algorithm but let the women propose to the men, who accept of reject the proposals, we would get the following matches.

algorithm terminated with matches = {('w2', 'm2'), ('w1', 'm1'), ('w3', 'm3')}

These matches are all different, but the marriages are all still stable. So, who proposes to whom can determine the results of the algorithm: we can run the program, swapping the men's/women's files, to see if it produces an alternative stable matching.

Functions and Script:

Write the following functions and script. I am providing line counts not as requirements, but to indicate the lengths of well-written Pythonic code.
  • read_match_preferences has an open (file) parameter; it returns the dictionary representing each man (or women, depending on which file is read) and his/her match (initially None) and preferences (body is 6 lines).

  • dict_as_str has a men or women dictionary, key function (default None) and bool (default False) as parameters; it returns a multi-line string (each line is ended by '\n'), which when printed shows the contents of the dictionary in the appropriate textual form. The key function determines the ordering and the bool determines whether to reverse it: like the key and reverse parameters used for sort/sorted in Python. This function is used to print both the men's/women's dictionaries, in the form dicussed above in the Input/Output section.

    Important: The key function (and its use when iterating over the dictionary in dict_as_str) must assume that its argument is a key in the dictionary, not an item; otherwise the batch self-check test will fail even if your code works. (body is 4 lines; can you do it in 1?).

  • who_prefer has a list (of str) of preferences and two values (str) that are in the list; it returns the value with the higher preference: e.g., who_prefer(['w3','w1','w2'], 'w2', 'w3') returns w3 -the one present earlier in the list. Hint: I used this function in make_match, defined below (body is 1 line).

  • extract_matches has a men dictionary as a parameter; it returns a set of 2-tuples: each has a match with the man in index 0 and the woman in index 1. Hint: I used this function in make_match defined below (body is 1 line).

  • make_match has a men and women dictionary as parameters, as well as a tracing parameter whose default value is False; it returns a set of 2-tuples: each has a match with the man in index 0 and the woman in index 1. This function uses the Gale/Shapley algorithm described above to find the match; if tracing is True it creates a trace in the form the example trace shown above (body is 25 lines, but only 18 lines without tracing code).

  • Write a script at the bottom of this module (in if __name__ == '__main__':) that prompts the user to enter the files storing the men and women preferences; reads these files and creates the required dictionaries; labels and prints both dictionaries (using dict_as_str); prompts the user about whether to trace the matching, then computes (using make_match) and prints the stable matches.

Sample Interaction:

The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should match this one.
  Enter a file storing preferences of men: men0.txt
  Enter a file storing preferences of women: women0.txt

  Men Preferences
    m1 -> [None, ['w3', 'w2', 'w1']]
    m2 -> [None, ['w3', 'w1', 'w2']]
    m3 -> [None, ['w2', 'w1', 'w3']]

  Women Preferences
    w1 -> [None, ['m1', 'm2', 'm3']]
    w2 -> [None, ['m2', 'm1', 'm3']]
    w3 -> [None, ['m3', 'm2', 'm1']]

  Trace the Algorithm[True]: False

  matches = {('m1', 'w2'), ('m3', 'w1'), ('m2', 'w3')}

Note that if the user specified True for tracing the algorithm, the program would also print all the information shown above in the example of tracing the Gale/Shapley algorithm. Finally, you can also try processing the men1.txt/women1.txt and men2.txt/women2.txt pairs of files. You can print these data files and hand-simulate the Gale/Shapely algorithsm on them to compute their stable matches. You can also feed the women file in as the men file, and the men file in as the women file, to see the match that results from letting women propose and men accept or reject: it can produce different matching pairs, but the matching pairs it produces will be stable.


Problem #3: Finite Automata

Problem Summary:

Write the required functions and script that prompts the user for the name of a file representing a finite automaton: indicating its states and input->state transitions; reads the information in the file (storing the finite automaton in a dictionary); prints the finite-automaton/dictionary in a special form; prompts the user for the name of a file storing the start-state and inputs to process (each line in the file contains this combination); repeatedly processes these lines, computing the results of the finite automaton on each input, and then prints the results. Note that a finite automaton is really a program; in this problem we are reading a program from a file and then executing it (running the finite automaton) on various inputs. So, we are really writing a compiler/interpreter for a small programming language.

A finite automaton (FA) is an machine that is sometimes called a Deterministic Finite Automaton (DFA; see the next problem for an NDFA: a non-deterministic finite automaton). An FA is described by its states and its transitions: each transition for a state specifies an input and what state in the FA that input leads to. We can illustrate an FA as a graph with state labels in circles and edge labels for transitions (see below).

Input and Output:

Read a file that describes a FA: each line contains a state and an arbitrary number of input->state transitions. Build a dictionary such that each key is a str state and whose associated value is another dictionary specifying all the transitions from that state: this second dictionary has keys that are str inputs and associated values are str states. The first token on each line is the str state and the remaining tokens (always coming in pairs) are str inputs and their resulting states. All tokens (which can comprise any number of characters) are separated by one semicolon character. We annotate this dictionary as {str:{str:str}}.

For example, the input file faparity.txt contains the following lines (which could appear in this order, or any other and still specify the same FA):

  even;0;even;1;odd
  odd;0;odd;1;even
Here is a picture of the parity FA. It graphically illustrates the two states (even and odd) and their transitions, using inputs (0 and 1) that always lead back to one of these two states.

Here, the state even (meaning it has seen an even number of 1 inputs so far) is a key in the main dictionary. It's value is a dictionary with two key/value pairs 0->even and 1->odd. It means that in the even state, if the input is a 0 the FA stays in the even state; if the input is a 1 the FA goes to the odd state. And similarly (the next line) means that for the odd state, if the input is a 0 the FA stays in the odd state; if the input is a 1 the FA goes back to the even state. So, seeing an input of 0 keeps the FA in the same state; seeing an input of 1 flips the FA into the other state.

Print the finite automaton, one state (and its transitions) per line; the states are printed alphabetically and the transition dictionary for each state is printed as a list of input/state items (tuples) such that these are printed alphabetically by the inputs.

For example, the file above would print as:

  The Description of this Finite Automaton
    even transitions: [('0', 'even'), ('1', 'odd')]
    odd transitions: [('0', 'odd'), ('1', 'even')]

Note that there are multiple data files for this program: faparity.txt and fadivisibleby3.txt; test/debug your program on the first file; when you are done, test it on the last file. Draw the FA represented by each for to ensure that your code correctly prints and computes with it. Important: This task is not to write a Python code that simulates the Parity FA; it is to write code that simulates any FA, whose description it reads from a file.

Next, repeatedly read and process lines from a second input file, computing the results of the finite automaton running on the specified start-state and its inputs; then print out the results in a special form. Each line in the file contains a start-state followed by a sequence of inputs (all separated by semicolons). The start-state will be a state in the FA (it is a key in the outer dictionary) the inputs may specify legal or illegal transitions (may or may not be keys in some inner dictonary).

For example, the input file fainputparity.txt contains the following three lines:

  even;1;0;1;1;0;1
  even;1;0;1;1;0;x
  odd;1;0;1;1;0;1
The first line means, the start-state is even and the inputs are 1, 0, 1, 1, 0, and 1.

The result of processing each line is to print the start-state, and then each input and the new state it transitions to, and finally print the stop-state. For the parity FA and the first line in this file, it should print

Start state = even
  Input = 1; new state = odd
  Input = 0; new state = odd
  Input = 1; new state = even
  Input = 1; new state = odd
  Input = 0; new state = odd
  Input = 1; new state = even
Stop state = even

Note that the second line contains an input x which is not a legal input allowed in any state; any such input should stop the simulation for that line only, continuing to start a new simulation for all following lines (as illustrated in the Sample Interaction).

Functions and Script:

Write the following functions and script. I am providing line counts not as requirements, but to indicate the lengths of well-written Pythonic code.
  • read_fa has an open (file) parameter; it returns the dictionary representing the finite automaton; hint: I used splicing and the zip function to build the inner dictionaries. (body is 6 lines).

  • fa_as_str has a dictionary parameter (representing the FA); it returns a multi-line string (each line is ended by '\n'), which when printed shows the contents of the FA in the appropriate textual form (body is 4 lines; can you do it in 1?).

  • process has a dictionary parameter (representing the FA), a str parameter (representing the start-state), and a list parameter (representing a list of str inputs); it returns a list that contains the start-state followed by tuples that show the input and resulting state after each transition. For the example shown above, process returns the following list.
    ['even', ('1', 'odd'), ('0', 'odd'), ('1', 'even'), ('1', 'odd'), ('0', 'odd'), ('1', 'even')]
    Finally, if an input is illegal (is not the key in some transition for the current state), say 'x', for the parity FA, then process should terminate with the last tuple in the list indicating a problem: ('x', None) (body is 9 lines).

  • interpret has a list parameter (the list result produced by process); it returns a multi-line string (each line is ended by '\n'), which when printed illustrates the results of processing an FA on an input in the appropriate textual form. See how it prints the example list argument shown above in the output further above. Also see the Sample Interaction below to see how it prints input errors: see the middle example (body is 9 lines).

  • Write a script at the bottom of this module (in if __name__ == '__main__':) that prompts the user to enter the file describing the FA, prints it, prompts the user to enter the file containing lines of start-states and input, simulates the FA on each line, printing the results in the appropriate textual form (body is 7 lines).

Sample Interaction:

The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should match this one.
  Enter a file storing a finite automaton: faparity.txt

  The Description of this Finite Automaton 
    even transitions: [('0', 'even'), ('1', 'odd')]
    odd transitions: [('0', 'odd'), ('1', 'even')]

  Enter a file storing a start-state and its inputs: fainputparity.txt
  
  Starting up a new FA simulation
  Start state = even
    Input = 1; new state = odd
    Input = 0; new state = odd
    Input = 1; new state = even
    Input = 1; new state = odd
    Input = 0; new state = odd
    Input = 1; new state = even
  Stop state = even
  
  Starting up a new FA simulation
  Start state = even
    Input = 1; new state = odd
    Input = 0; new state = odd
    Input = 1; new state = even
    Input = 1; new state = odd
    Input = 0; new state = odd
    Input = x; illegal input: simulation terminated
  Stop state = None

  Starting up a new FA simulation
  Start state = odd
    Input = 1; new state = even
    Input = 0; new state = even
    Input = 1; new state = odd
    Input = 1; new state = even
    Input = 0; new state = even
    Input = 1; new state = odd
  Stop state = odd

You can also try the fadivisibleby3.txt finite automaton file, which determines whether an integer (sequence of digits) is divisible by 3: it is divisible if the finite automaton stops in state rem0. Its input file fainputdivisibleby3.txt tries the number 12,435,711, which is divisible by 3 and number 823, which is not divisible by 3: dividing 823 by 3 leaves a remainder of 1.


Problem #4: Non-Deterministic FA

Problem Summary:

Write the required functions and script that solve, for a Non-Deterministic Finite Automaton, the same problem that was solved for a Deterministic Finite Automaton in Problem #3 (above). Read about the differences between these two automata (below). Hint: Adapt your code for the FA problem to solve the more general NDFA problem.

A non-deterministic finite automaton (NDFA) is machine described by its states and its transitions: each transition for a state specifies an input and a set of states (more than one is allowed) that input can lead to: sets with more than one states is what makes it non-deterministic. We can illustrate a NDFA as a graph with state labels in circles and edge labels for transitions (see below). The critical difference between an FA and an NDFA is that an NDFA can have multiple edges with the same label going to different states (we'll see how to represent and simulate such transitions below).

Input and Output:

Read a file that describes a NDFA: each line contains a state and an arbitrary number of input->state transitions. Build a dictionary such that each key is a str state and whose associated value is another dictionary specifying all the transitions from that state: this second dictionary has keys that are str inputs and associated values that are sets of str states: all the states a particular input can lead to. The first token on each line is the str state and the remaining tokens (always coming in pairs) are str inputs and states. All tokens (which can comprise any number of characters) are separated by one semicolon character. We annotate this dictionary as {str:{str:{str}}}.

For example, the input file ndfaendin01.txt contains the following lines (which could appear in this order, or any other and still specify the same NDFA):

  start;0;start;1;start;0;near
  near;1;end
  end
Here is a picture of the endin01 NDFA. It graphically illustrates the three states (start, near, and end) and their transitions, using inputs (0 and 1).

Here, the state start is a key in the main dictionary. It's value is a dictionary with two key/value pairs: 0 mapping to the set containing start and near, and 1 mapping to the set containing just start. It means that in the start state, if the input is a 0 the NDFA can stay in the start state or it can go to the near state; if the input is a 1 the NDFA must stay in the start state. And similarly the next line means that in the near state, if the input is a 1 the NDFA must go into the end state. The last line means that the end state has no transitions out of it.

Print the NDFA, one state (and its transitions) per line; the states are printed alphabetically and the transition dictionary for each state is printed as a list of input/set of state items (2-tuples) such that these are printed alphabetically by the inputs, and the set of states for each input is printed as an alphabetically sorted list (e.g., near comes before start). Note that the state end is a key in the main dictionary, whose associated transitions are an empty dictionary.

For example, the file above would produce:

  The Description of this Non-Deterministic Finite Automaton 
    end transitions: []
    near transitions: [('1', ['end'])]
    start transitions: [('0', ['near', 'start']), ('1', ['start'])]

Note that there are multiple data files for this program: ndfaendin01.txt and ndfatrain.txt and ndfare.txt;; test/debug your program on the first file; when you are done, test it on the last file. Draw the FA represented by each for to ensure that your code correctly prints and computes with it.

Next, repeatedly read and process lines from a second input file, computing the results of the non-determinisitc finite automaton on the specified start-state and its inputs ; then print out the results in a special form. Each line in the file contains a start-state followed by a sequence of inputs (all separated by semicolons). The start-state will be a state in the FA (it is a key in the outer dictionary) the inputs may specify legal or illegal transitions (may or may not be keys in some inner dictionary).

For example, the input file ndfainputendin01.txt contains the following two lines:

  start;1;0;1;1;0;1
  start;1;0;1;1;0;0
For example, the first line means, the start-state is start and the inputs 1, 0, 1, 1, 0, and 1.

The result of processing each line is to print the start-state, and then each input and the new states (plural) it could transition to (the could is what makes it non-deterministic), and finally print the stop-states. For the ndfaendin01 NDFA and the first line in this file, it should print

  Start state = start
    Input = 1; new possible states = ['start']
    Input = 0; new possible states = ['near', 'start']
    Input = 1; new possible states = ['end', 'start']
    Input = 1; new possible states = ['start']
    Input = 0; new possible states = ['near', 'start']
    Input = 1; new possible states = ['end', 'start']
  Stop state(s) = ['end', 'start']

Note that the set of states it might be in are printed as an alphabetized list. Also note especially that in the start state, if the input is a 0, then the NDFA can either remain in the start state or go into the near state. For this program, we keep track of all states that the NDFA can be in, using a set of new possible states. For the next input, 1, we can be either in the start state (from the start state; an input of 1 allows us to stay in the start state) or the end state (from the near state; an input of 1 allows us to transition to the end state). Thus, we keep track of the set of states the NDFA can be in, and the new set of states the NDFA can be in after processing the next input. In this example, because 'end' is included in the stop-states, this input does end in 01.

For any state that does not have a transition specifying the current input, ignore that input for that state. For example, if near is one of the possible states and 0 is the input, ignore the 0 for the near state.

Functions and Script:

Write the following functions and script. I am providing line counts not as requirements, but to indicate the lengths of well-written Pythonic code.
  • read_ndfa has an open (file) parameter; it returns the dictionary representing the non-deterministic finite automaton; hint: I used splicing and the zip function to build the inner dinctionaries, and I called the setdefault function for the inner dict: alternatively I could have built it as defaultdicts from the standard collections module (body is 9 lines).

  • ndfa_as_str has a dictionary parameter (representing the FA); it returns a multi-line string (each line is ended by '\n'), which when printed shows the contents of the NDFA in the appropriate textual form (body is 4 lines; can you do it in 1?).

  • process has a dictionary parameter (representing the NDFA), a str parameter (representing the start-state), and a list parameter (representing a list of str inputs); it returns a list that contains the start-state followed by tuples that show the input and resulting set of states after each transition. For the example shown above, process returns the following list.

      ['start', ('1', {'start'}), ('0', {'near', 'start'}), ('1', {'end', 'start'}), ('1', {'start'}),
        ('0', {'near', 'start'}), ('0', {'near', 'start'})]
    Finally, remember that if an input is illegal for the current state (is not the key in some transition for the current state), just ignore it. But if the input leads to no possible states (the empty set of states) terminate processing there (body is 12 lines).

  • interpret has a list parameter (the list result produced by process); it returns a multi-line string (each line is ended by '\n'), which when printed illustrates the results of processing an NDFA on an input in the appropriate textual form. Note that in this output the sets computed in process appear as lists sorted alphabetically by state. See how it prints the example list argument shown above in the Sample Interaction below (body is 5 lines).

  • Write a script at the bottom of this module (in if __name__ == '__main__':) that prompts the user to enter the file describing the NDFA, prints it, prompts the user to enter the file containing lines of start-states and input, and simulates the NDFA on each line, printing the results in the appropriate textual form (body is 7 lines).

Sample Interaction:

The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should "match" this one.
  Enter a file storing a non-deterministic finite automaton: ndfaendin01.txt

  The Description of this Non-Deterministic Finite Automaton
    end transitions: []
    near transitions: [('1', ['end'])]
    start transitions: [('0', ['near', 'start']), ('1', ['start'])]

  Enter a file storing a start-state and its inputs: ndfainputendin01.txt

  Starting up a new NDFA simulation
  Start state = start
    Input = 1; new possible states = ['start']
    Input = 0; new possible states = ['near', 'start']
    Input = 1; new possible states = ['end', 'start']
    Input = 1; new possible states = ['start']
    Input = 0; new possible states = ['near', 'start']
    Input = 1; new possible states = ['end', 'start']
  Stop state(s) = ['end', 'start']
  
  Starting up a new NDFA simulation
  Start state = start
    Input = 1; new possible states = ['start']
    Input = 0; new possible states = ['near', 'start']
    Input = 1; new possible states = ['end', 'start']
    Input = 1; new possible states = ['start']
    Input = 0; new possible states = ['near', 'start']
    Input = 0; new possible states = ['near', 'start']
  Stop state(s) = ['near', 'start']

In Week #2 of this course we will cover EBNF and regular expressions, which relate to the files below. You can run these files on your code to ensure they produce the correct results.

The ndfatrain.txt file is a non-deterministic finite automaton that determines whether or not a train (a sequence of characters representing different kinds of cars) is a legal train according to Chapter Exercise #7 in the ENBF lecture. Its input file is ndfainputtrain.txt, which starts with a legal train (one that ends with the state done as one possible state) followed by an illegal train (one that does not end with the state done as one possible state).

The ndfare.txt file is a non-deterministic finite automaton translation of the regular expression ((a*|b)cd)+. Its input file is ndfainputre.txt, which starts with a match (one that ends with the state last as one possible state) followed by a non-match (one that does not end with the state last as one possible state).


Problem #5: Google Queries

Problem Summary:

Write the required functions and script that prompts the user to enter the name of a file of text that contains a sequence of Google queries; reads the file (storing it in two special prefix and query dictionaries); repeatedly: prints the dictionaries in a special form; prompts the user to enter a query prefix, displays the top three queries with that prefix; prompts the user to enter a new full query, and then updates the dictionaries as if this new full query had appeared at the end of the file initially read.

Background:

When we type a word (or a few words) into Google's query box, it shows some of the most frequently entered queries starting with those word(s). For example, when I recently typed the word uci into Google, it showed the following as the 3 most frequent queries starting with uci:
  • uci law
  • uci medical center
  • uci women's soccer
I could have clicked on one of these queries to select it, or continued typing more words to specify my own (different) query.

Here we say uci is a prefix, which is the beginning of some full query, like uci medical center.

Google represents a full query as a tuple of str (words). For example, ('uci', 'medical', 'center') is a full query. Google also represents a prefix as a tuple of str (words). For example, ('uci',) is a one-word prefix and ('uci', 'medical') is a two-word prefix of this full query.

From any full query we can compute a set of all its prefixes. For example, the full query ('uci', 'medical', 'center') would compute the prefix set {('uci',), ('uci', 'medical'), ('uci', 'medical', 'center')}. The prefix set includes a tuple of the first word, a tuple of the first two words, ... and finally a tuple of all the words in the full query.

Google stores information (in dictionaries) that allows it to predict the most likely full query from any prefix the user enters in the Google search box (as discussed in the example above). The prediction is based on (1) knowing all the full queries for a prefix and (2) knowing how many times each full query was used. Using this information, Google can show the user the most frequently entered full queries for the prefix he/she typed.

Google stores two dictionaries to accomplish this task. Recall that dictionary keys and set values must be immutable types: tuples are immutable (as are strings and integers) but not lists.

  1. Google stores a prefix dictionary whose key is a prefix (a tuple) and whose associated value is a set of all the full queries that have been entered for that prefix.

  2. Google stores a query dictionary whose key is a full query (again a tuple) and whose associated value is an int: the number of times (the frequency) that that full query was used.
IMPORTANT: Use a defaultdict to store both of these dictionaries.

In this program you will build these dictionaries and then use them it to predict a full query from a prefix entered by the user, and update the dictionaries for any new query.

Input and Output:

After prompting the user for the file of full queries, read the file, building the prefix and query dictionaries (we are dropping the word full now).

For a simple example, the file googleq0.txt contains the following lines (in it, for simplicity and conciseness, we abbreviated b = basketball, c = center, l = law, m = medical, s = soccer, u = uci, and w = women's).

  u m c
  u l
  u w s
  u l
  u w s
  u w b
  u w b
  u w b

The program will first read this file and build the appropriate prefix and query dictionaries; then it will print each dictionary. The prefix dictionary should be sorted by keys, from the shortest to longest prefix, with equal-length prefixes sorted in standard lexical order; of course the associated sets may print their values in any order. The query dictionary should be sorted by associated values (integers), from largest to smallest integer, with equal integers sorted by their keys in standard lexical order.

For example, the file above would produce the following output:

  Prefix dictionary:
    ('u',) -> {('u', 'm', 'c'), ('u', 'l'), ('u', 'w', 'b'), ('u', 'w', 's')}
    ('u', 'l') -> {('u', 'l')}
    ('u', 'm') -> {('u', 'm', 'c')}
    ('u', 'w') -> {('u', 'w', 'b'), ('u', 'w', 's')}
    ('u', 'm', 'c') -> {('u', 'm', 'c')}
    ('u', 'w', 'b') -> {('u', 'w', 'b')}
    ('u', 'w', 's') -> {('u', 'w', 's')}

  Query dictionary:
    ('u', 'w', 'b') -> 3
    ('u', 'l') -> 2
    ('u', 'w', 's') -> 2
    ('u', 'm', 'c') -> 1

In the prefix dictionary ('u',) appears before ('u', 'l') because it has fewer words; and ('u', 'l') appears before ('u', 'w') because in standard lexical order, when 2-tuples have equal first values, they are ordered by their second values, and 'l' comes before 'w'.

In the query dictionary ('u', 'w', 'b') appears before ('u', 'l') because the first tuple's associated value (3) is bigger than the second tuple's (2); and ('u', 'l') appears before ('u', 'w', 's') because when tuples are associated with equal values (2), they are ordered lexically, and ('u', 'l', ...) comes before ('u', 'w', ...) (see the reasoning above).

Now, repeatedly prompt the user for any query prefix and print the top three full queries for the entered prefix: print them in order from most to least frequent full query (with ties printed using the standard lexical ordering; the same ordering used when printing the Query dictionary above). Using the dictionaries above the iteraction would be.

  Enter any prefix sequence (or else quit): u
    Top 3 (no more) matching full queries: [('u', 'w', 'b'), ('u', 'l'), ('u', 'w', 's')]
Finally, prompt the user to enter the full query, and update the dictionaries and reprint them.
  Enter any full query sequence (or else quit): u w s

  Prefix dictionary:
    ('u',) -> {('u', 'm', 'c'), ('u', 'w', 'b'), ('u', 'l'), ('u', 'w', 's')}
    ('u', 'l') -> {('u', 'l')}
    ('u', 'm') -> {('u', 'm', 'c')}
    ('u', 'w') -> {('u', 'w', 'b'), ('u', 'w', 's')}
    ('u', 'm', 'c') -> {('u', 'm', 'c')}
    ('u', 'w', 'b') -> {('u', 'w', 'b')}
    ('u', 'w', 's') -> {('u', 'w', 's')}

  Query dictionary:
    ('u', 'w', 'b') -> 3
    ('u', 'w', 's') -> 3
    ('u', 'l') -> 2
    ('u', 'm', 'c') -> 1
Here, the prefix dictionary stays the same (the full query already was entered once; we could have entered a new full query, which would augment the prefix dictionary), and the full query ('u', 'w', 's') has its query-count increased from 2 to 3.

Functions and Script:

Write the following functions and script. I am providing line counts not as requirements, but to indicate the lengths of well-written Pythonic code.
  • all_prefixes has a tuple of str (words) parameter; it returns a set of tuple of str: all the prefixes of the full query argument. For example, all_prefixes(('a', 'b', 'c')) returns {('a',), ('a', 'b'), ('a', 'b', 'c')}. Hints: comprehension and slicing (body is 1 line).

  • add_query has a prefix dictionary, query dictionary, and new full query (tuple of str) as parameters; it returns None but updates these two dictionaries based on the new full query. It adds the new full query's prefixes to the prefix dictionary (each associated with the new full query) and increments the integer value associated with that full query in the query dictionary (or, if the full query is not in the dictionary, associates that full query with 1) (body is 3 lines).

  • read_queries has an open (file) parameter; it returns a 2-tuple containing the prefix and query dictionaries (in that order) built by reading and processing each full query in this file. (body is 5 lines).

  • dict_as_str has a dictionary, key function (default None) and bool (default False) as parameters; it returns a multi-line string (each line is ended by '\n'), which when printed shows the contents of the dictionary in the appropriate textual form. The key function determines the ordering and the bool determines whether to reverse it: like the key and reverse parameters used for the sort/sorted functions in Python. This function is used to print both the prefix and query dictionaries.

    Important: The key function (and its use when iterating over the dictionary in dict_as_str) must assume that its argument is a key in the dictionary, not an item; otherwise the batch self-check test will fail even if your code works. (body is 4 lines; can you do it in 1?).

  • top_n has a prefix (tuple of str), int, prefix dictionary, and query dictionary as parameters; it returns a list of full queries (tuple of str) whose length is the integer parameter, containing the most frequent full queries with that prefix; if the number of full queries with that prefix is less than that integer parameter, return all the full queries. If no full queries have this prefix, return the empty list. Notes: The dictionaries should not be changed. If multiple full queries occur the same number of times, prefer the full queries that come earlier in the standard lexical ordering: e.g., the same order they are printed in the query dictionary. (body is 3 lines; can you do it in 1?).

  • Write a script at the bottom of this module (in if __name__ == '__main__':) that prompts the user to enter the file storing the queries and builds the prefix and query dictionaries from this file; then repeated: print these dictionaries; prompt the user to enter a prefix; display the top three full queries with that prefix; prompt the user to enter a full query; and finally update the prefix and query dictionaries with that full query (13 lines).

Sample Interaction:

The program, as specified, will have the following interaction: user-typed information appears in italics. Your output should match the form of this one (the order of values in the sets may vary).
Enter a file storing a full query: googleq0.txt

Prefix dictionary:
  ('u',) -> {('u', 'l'), ('u', 'w', 'b'), ('u', 'w', 's'), ('u', 'm', 'c')}
  ('u', 'l') -> {('u', 'l')}
  ('u', 'm') -> {('u', 'm', 'c')}
  ('u', 'w') -> {('u', 'w', 'b'), ('u', 'w', 's')}
  ('u', 'm', 'c') -> {('u', 'm', 'c')}
  ('u', 'w', 'b') -> {('u', 'w', 'b')}
  ('u', 'w', 's') -> {('u', 'w', 's')}

Query dictionary:
  ('u', 'w', 'b') -> 3
  ('u', 'l') -> 2
  ('u', 'w', 's') -> 2
  ('u', 'm', 'c') -> 1

Enter any prefix sequence (or else quit): u
  Top 3 (no more) matching full queries: [('u', 'w', 'b'), ('u', 'l'), ('u', 'w', 's')]

Enter any full query sequence (or else quit): u w s

Prefix dictionary:
  ('u',) -> {('u', 'l'), ('u', 'w', 'b'), ('u', 'w', 's'), ('u', 'm', 'c')}
  ('u', 'l') -> {('u', 'l')}
  ('u', 'm') -> {('u', 'm', 'c')}
  ('u', 'w') -> {('u', 'w', 'b'), ('u', 'w', 's')}
  ('u', 'm', 'c') -> {('u', 'm', 'c')}
  ('u', 'w', 'b') -> {('u', 'w', 'b')}
  ('u', 'w', 's') -> {('u', 'w', 's')}

Query dictionary:
  ('u', 'w', 'b') -> 3
  ('u', 'w', 's') -> 3
  ('u', 'l') -> 2
  ('u', 'm', 'c') -> 1

Enter any prefix sequence (or else quit): u w
  Top 3 (no more) matching full queries: [('u', 'w', 'b'), ('u', 'w', 's')]

Enter any full query sequence (or else quit): a b c

Prefix dictionary:
  ('a',) -> {('a', 'b', 'c')}
  ('u',) -> {('u', 'l'), ('u', 'w', 'b'), ('u', 'w', 's'), ('u', 'm', 'c')}
  ('a', 'b') -> {('a', 'b', 'c')}
  ('u', 'l') -> {('u', 'l')}
  ('u', 'm') -> {('u', 'm', 'c')}
  ('u', 'w') -> {('u', 'w', 'b'), ('u', 'w', 's')}
  ('a', 'b', 'c') -> {('a', 'b', 'c')}
  ('u', 'm', 'c') -> {('u', 'm', 'c')}
  ('u', 'w', 'b') -> {('u', 'w', 'b')}
  ('u', 'w', 's') -> {('u', 'w', 's')}

Query dictionary:
  ('u', 'w', 'b') -> 3
  ('u', 'w', 's') -> 3
  ('u', 'l') -> 2
  ('a', 'b', 'c') -> 1
  ('u', 'm', 'c') -> 1

Enter any prefix sequence (or else quit): quit

You can also try processing the googleq1.txt and googleq2.txt files