Informatics 42 • Winter 2012 • David G. Kay • UC Irvine

Lab Assignment D

This assignment is due at the end of lab on Monday, February 27. This is a pair programming assignment; do it with someone you haven't worked with yet this quarter and make sure Joel knows whom you've paired with.

The problem: The assignment, originally written at Stanford, involves building a program that reads a grammar and generates a specified number of strings (or "sentences") from that grammar. Before you work on your own program, you can get a broader idea of what's possible by trying out the applet at http://www-cs-faculty.stanford.edu/~zelenski/rsg. The "Extension Request" grammar (the default) is pretty funny; also try out "CS assignment," "Programming bug,", and "Math expression," along with any others that strike your fancy. Then follow the "Directory of the collected grammar files" link at the bottom of the page. For a general idea of the kind of input your program will take, pick some of these grammars to look at; "Math expression" is the easiest to follow. (There are differences in format between the grammars shown there and what's shown here in our assignment; theirs should be readable, but use our format for this assignment.)

Entertainment isn't the only application of this program. Sometimes we need to test our software with a lot of data, more than we can conveniently generate by hand. With a program like this, we only need to create a grammar describing our test data and we can produce as much of it as we want.

Background on grammars: A grammar is a collection of substitution rules that describe a set of strings or sentences. Each sentence is a sequence of terminal symbols or just terminals. Different kinds of sentence fragments are represented by nonterminal symbols or variables, with a rule for each variable specifying how it can be replaced by one of a set of possible sequences of variables and terminals. One of the variables is designated as the start variable, which means that it represents an entire sentence.

An example of a grammar follows, to supplement the ones we covered in class. The start variable is A. The variables are A and B, while the terminals are 0, 1, and #.

This grammar says that the variable A can be replaced either with the sequence 0A1A or the variable B, while the variable B can only be replaced with #.

From a conceptual point of view, a grammar can be used to generate strings of terminals in the following manner. (We should point out that this isn't precisely how your program will generate its strings, but your program will do something that has an equivalent effect.)

  1. Begin with the start variable.
  2. So long as there are still variables that have not been substituted, pick a variable and a rule with that variable on the left-hand side. Replace the variable with the right-hand side of the rule that you chose.

A sequence of substitutions leading from the start variable to a string of terminals is called a derivation. When the leftmost variable is always replaced at each step, the derivation is called a leftmost derivation. The string 00#1#1# can be generated by the grammar above. The following leftmost derivation — which begins with the start variable, with one substitution made at each step — proves that it can be done; the new part of each step is shown underlined.

A0A1A ⇒ 00A1A1A ⇒ 00B1A1A ⇒ 00#1A1A ⇒ 00#1B1A ⇒ 00#1#1A ⇒ 00#1#1B ⇒ 00#1#1#

Since 00#1#1# can be generated by the grammar, we would say that the string 00#1#1# is in the language of the grammar. In other words, the language of a grammar is the set of all strings that can be generated from it. (Many grammars, including this one, have an infinite number of strings in their languages. This grammar generates an infinite number of strings since the rule A → 0A1A can be used an arbitrary number of times in a derivation.)

In our random sentence generator, a grammar will describe a set of sentences (which may indeed be infinite). Each sentence you generate will be a sequence of characters, with the characters being the terminals in the grammar. The variables in the grammar will describe sentence fragments, with the start variable describing an entire sentence.

The program: You will write a Python program that prompts the user for three things: the name of a file containing a grammar, the name of the start symbol for that grammar, and a number of sentences to generate with that grammar.

Grammar file format: The grammar file will contain a series of grammar rules in the format described below. Our simple cat-and-dog example from class is available in this format; so is a grammar for Facile programs.

You may assume that the grammar files will always be correctly formatted; you do not have to anticipate or correct any errors in the grammar file.

Stage I: First, make sure you can parse the input file correctly. As we've seen all quarter, an essential part of many programs is converting data from some external form into the internal form we've chosen to use in our program as the model. Parsing text files can be painstaking and tedious (although Python provides good tools to help), and it's not usually very interesting. But until you're comfortable doing it, you won't be able to use Python to write real programs.

For this stage, just produce output that convinces you that your program is recognizing the tokens in the input file correctly; don't try to store the data in your program's ultimate data structures yet. For this (partial) grammar file:

Cat-and-Dog Grammar
{
[S]
[NP] [SPACE] [VP]
}
{
[Art]
a
the
}
your output might look like this:
Rule
   variable --S--
   RightSide
      variable --NP--      
      variable --SPACE--      
      variable --VP--
Rule
   variable --Art--
   RightSide
      terminal --a--
   RightSide
      terminal --the--     
You don't have to match the precise appearance of this example, particularly not the indentation (though you may do it that way if you like). You just want to identify each token and what its function is; the dashes are there to make sure you don't have extra whitespace in the wrong places.

Ultimately, you'll remove or comment out this code, since this output isn't part of the program specification. [Note, though, that writing this kind of testing code is a valuable technique; it helps you be sure your parsing works independent of other parts of your program. Don't be inhibited from writing code that's useful during development just because it won't make it into the final product.]

Stage II: Implement these data structures in your program (possibly in classes, possibly in namedtuples, possibly just in Python code):

Then fill these data structures with the grammar data from the input.

Stage III: Handle the derivations of the specified number of sentences. Here's some advice about the sentence generation algorithm. Each of the data structures above has its own way of processing its part of the derivation. (Probably the easiest way to code this is with one generate function that can be called with any of the data structures above; it would use an if-ladder to determine what data structure it's dealing with and handle that structure as described below.)

You'll find Python's random module useful; import it so you can use choice() and possibly other functions.

When you're done:

Modified by David G. Kay, Winter 2012. Originally written by Alex Thornton (with heavy influence from "The Worst Joke Ever," by Alex Thornton), Winter 2007. Original concept by Mike Cleron of Stanford University; modified and adapted by Allison Hansen, Julie Zelinski, and others.