Lab Assignment D

Informatics 42 • Winter 2012 • David G. Kay • UC Irvine

Lab Assignment D

This assignment is due at the end of lab on Monday, February 27. This is a pair programming assignment; do it with someone you haven't worked with yet this quarter and make sure Joel knows whom you've paired with.

The problem: The assignment, originally written at Stanford, involves building a program that reads a grammar and generates a specified number of strings (or "sentences") from that grammar. Before you work on your own program, you can get a broader idea of what's possible by trying out the applet at http://www-cs-faculty.stanford.edu/~zelenski/rsg. The "Extension Request" grammar (the default) is pretty funny; also try out "CS assignment," "Programming bug,", and "Math expression," along with any others that strike your fancy. Then follow the "Directory of the collected grammar files" link at the bottom of the page. For a general idea of the kind of input your program will take, pick some of these grammars to look at; "Math expression" is the easiest to follow. (There are differences in format between the grammars shown there and what's shown here in our assignment; theirs should be readable, but use our format for this assignment.)

Entertainment isn't the only application of this program. Sometimes we need to test our software with a lot of data, more than we can conveniently generate by hand. With a program like this, we only need to create a grammar describing our test data and we can produce as much of it as we want.

Background on grammars: A grammar is a collection of substitution rules that describe a set of strings or sentences. Each sentence is a sequence of terminal symbols or just terminals. Different kinds of sentence fragments are represented by nonterminal symbols or variables, with a rule for each variable specifying how it can be replaced by one of a set of possible sequences of variables and terminals. One of the variables is designated as the start variable, which means that it represents an entire sentence.

An example of a grammar follows, to supplement the ones we covered in class. The start variable is A. The variables are A and B, while the terminals are 0, 1, and #.

A → 0A1A | B
B → #

This grammar says that the variable A can be replaced either with the sequence 0A1A or the variable B, while the variable B can only be replaced with #.

From a conceptual point of view, a grammar can be used to generate strings of terminals in the following manner. (We should point out that this isn't precisely how your program will generate its strings, but your program will do something that has an equivalent effect.)

Begin with the start variable.
So long as there are still variables that have not been substituted, pick a variable and a rule with that variable on the left-hand side. Replace the variable with the right-hand side of the rule that you chose.

A sequence of substitutions leading from the start variable to a string of terminals is called a derivation. When the leftmost variable is always replaced at each step, the derivation is called a leftmost derivation. The string 00#1#1# can be generated by the grammar above. The following leftmost derivation — which begins with the start variable, with one substitution made at each step — proves that it can be done; the new part of each step is shown underlined.

A ⇒ 0A1A ⇒ 00A1A1A ⇒ 00B1A1A ⇒ 00#1A1A ⇒ 00#1B1A ⇒ 00#1#1A ⇒ 00#1#1B ⇒ 00#1#1#

Since 00#1#1# can be generated by the grammar, we would say that the string 00#1#1# is in the language of the grammar. In other words, the language of a grammar is the set of all strings that can be generated from it. (Many grammars, including this one, have an infinite number of strings in their languages. This grammar generates an infinite number of strings since the rule A → 0A1A can be used an arbitrary number of times in a derivation.)

In our random sentence generator, a grammar will describe a set of sentences (which may indeed be infinite). Each sentence you generate will be a sequence of characters, with the characters being the terminals in the grammar. The variables in the grammar will describe sentence fragments, with the start variable describing an entire sentence.

The program: You will write a Python program that prompts the user for three things: the name of a file containing a grammar, the name of the start symbol for that grammar, and a number of sentences to generate with that grammar.

Grammar file format: The grammar file will contain a series of grammar rules in the format described below. Our simple cat-and-dog example from class is available in this format; so is a grammar for Facile programs.

Each rule starts with a left curly brace ("{") on its own line and ends with a right curly brace ("}") on its own line. Other lines in the file, outside of the rules, are ignored by your program; they can serve as annotations or comments in the rule file.
After the opening brace, the first line of the rule is the rule's left-hand side. Remember that the left-hand side of a rule is always a single nonterminal (variable). In our grammar files, all nonterminal/variable names are enclosed in square brackets ("[" and "]"); you should remove these before you store the variable names in your program. Variable names will not contain whitespace (spaces, tabs, or newlines).
Each of the remaining lines in the rule (before the closing brace) is an alternative right-hand side (i.e., a different way of rewriting the left-hand side). Each righthand side consists of a sequence of variables (whose names are enclosed in brackets, of course) and strings of terminal symbols (that may not contain whitespace, though you may use the Python conventions of \t for tab, \n for newline, and \x20 (hexadecimal 20) for space). (If you're wondering how to have a grammar for a language that includes brackets in its terminal alphabet, again you can use their hexadecimal escape sequences, \x5C for "[" and \x5D for "]".)

You may assume that the grammar files will always be correctly formatted; you do not have to anticipate or correct any errors in the grammar file.

Stage I: First, make sure you can parse the input file correctly. As we've seen all quarter, an essential part of many programs is converting data from some external form into the internal form we've chosen to use in our program as the model. Parsing text files can be painstaking and tedious (although Python provides good tools to help), and it's not usually very interesting. But until you're comfortable doing it, you won't be able to use Python to write real programs.

For this stage, just produce output that convinces you that your program is recognizing the tokens in the input file correctly; don't try to store the data in your program's ultimate data structures yet. For this (partial) grammar file:

Cat-and-Dog Grammar
{
[S]
[NP] [SPACE] [VP]
}
{
[Art]
a
the
}

your output might look like this:

Rule
   variable --S--
   RightSide
      variable --NP--      
      variable --SPACE--      
      variable --VP--
Rule
   variable --Art--
   RightSide
      terminal --a--
   RightSide
      terminal --the--

You don't have to match the precise appearance of this example, particularly not the indentation (though you may do it that way if you like). You just want to identify each token and what its function is; the dashes are there to make sure you don't have extra whitespace in the wrong places.

Ultimately, you'll remove or comment out this code, since this output isn't part of the program specification. [Note, though, that writing this kind of testing code is a valuable technique; it helps you be sure your parsing works independent of other parts of your program. Don't be inhibited from writing code that's useful during development just because it won't make it into the final product.]

Stage II: Implement these data structures in your program (possibly in classes, possibly in namedtuples, possibly just in Python code):

Grammar: The name of the start symbol (initial variable) plus a collection of rules.
Rule: The name of a nonterminal symbol (variable) and a collection of right-hand sides (alternatives for rewriting the variable) from which you'll choose one alternative at random each time the start symbol occurs in your derivation. (Hint: You might just implement this as a key-value pair.)
Right-hand side: A sequence of terminal and nonterminal symbols; each right-hand side of a rule is one alternative for rewriting a nonterminal symbol.
Symbol: Either nonterminal (which must be rewritten with one of the right-hand sides of that symbol's rules) or terminal (which is a literal character or string to appear in the final sentence).

Then fill these data structures with the grammar data from the input.

Stage III: Handle the derivations of the specified number of sentences. Here's some advice about the sentence generation algorithm. Each of the data structures above has its own way of processing its part of the derivation. (Probably the easiest way to code this is with one generate function that can be called with any of the data structures above; it would use an if-ladder to determine what data structure it's dealing with and handle that structure as described below.)

To generate a sentence from a grammar, look up the rule corresponding to its start variable and apply it (i.e., call generate with it). Ultimately it should return a string (sentence) of terminals generated by the grammar, but each part is handled by a separate call to generate.
To generate from a rule, choose one of its right-hand sides at random and call generate on it.
To generate from a right-hand side, go through each symbol from left to right, whether terminal or nonterminal, and call generate on each.
To generate from a nonterminal, look up the corresponding rule in the grammar and call generate on it, returning the result as a sentence fragment.
To generate from a terminal, simply return the terminal's value as a sentence fragment.

You'll find Python's random module useful; import it so you can use choice() and possibly other functions.

When you're done:

Submit all your Python source code in one .py file via Checkmate. Also submit at least two additional grammar files of your own. Each pair should submit just one solution with both partners' names clearly indicated.
Along the lines of Alex's Facile grammar, make up a grammar for (a tiny subset of) Scheme or Python and see what kinds of programs it generates. You can see full grammars for Scheme and Python; the links are on the course reference page. You don't have to turn this in.
The usual grading criteria for lab assignments apply.
Fill out a partner evaluation at EEE.

Modified by David G. Kay, Winter 2012. Originally written by Alex Thornton (with heavy influence from "The Worst Joke Ever," by Alex Thornton), Winter 2007. Original concept by Mike Cleron of Stanford University; modified and adapted by Allison Hansen, Julie Zelinski, and others.