Eighth Homework

This assignment is due on Wednesday, March 12. There's not much coding here except for the last part, which you may do in pairs.

(a) Below is some code that implements a finite-state machine.

final int Secret1 = 35;

final int Secret2 = 127;

final int Secret3 = 33;

String[] stateList = {"Init", "GotFirst", "GotSecond", "Success");

int number;

int count = 1;

String state;

void main()

{

state = "Init";

while ((!state.equals("Success")) && (count <= 3))

{

number = getNext();

if (state.equals("Init"))

{

if (number == Secret1)

state = "GotFirst";

else state = "Init";

}

else if (state.equals("GotFirst"))

{

if (number == Secret2)

state = "GotSecond";

else state = "Init";

}

else if (state.equals("GotSecond"))

{

if (number == Secret3)

state = "Success";

else state = "Init";

}

count++;

}

if (state.equals("Success"))

System.out.println("Input accepted.");

else

System.out.println("Input rejected.");

}

(a.1) Draw the state transition diagram that represents the FSA this program implements. The input tokens here are whole integers, not individual characters.

(a.2) Describe in one brief English sentence what this FSA does. Try to think of a simple, real-world, non-computer-related object that this FSA models.

Intermezzo: State transition diagrams are one way to describe FSAs. Another way (which is easier to represent in a computer) is a transition table. A transition table has a row for each state and a column for each input; the value at each position in the table tells you what state to go to when you read a given input in a given state. Below is a transition table for the program above:

Secret1 Secret2 Secret3 other

Init GotFirst Init Init Init

GotFirst Init GotSecond Init Init

GotSecond Init Init Success Init

Success

This table says just what the program and the state transition diagram say: If you're in Init and you read Secret1, you go into state GotFirst; if you're in state Init and you read anything else, you stay in Init. If you're in state GotFirst and you read Secret2, you go into state GotSecond; if you read anything else in state GotFirst you go to state Init. And finally, if you're in state GotSecond and you read Secret3, you go to state Success; otherwise, you go to Init. In state Success the machine stops, so you don't make any transitions out of that state; it's the accept state.

As we noted, transition tables make FSAs easy to represent in a computer. In fact, we can write a very simple but very general FSA simulator according to the following pseudocode:

initialize TransitionTable;
state initial state;
while there are more tokens:
   get a token;
   state TransitionTable[state][token];
if state = accept state
   then accept
   else reject.

Because this code is so simple, it's the preferred way to implement FSAs in programs. The only tricky part is finding a data type that will represent the range of tokens and will at the same time be acceptable as an array index in your programming language. Scheme, for example, handles symbolic names very easily. But in the above Java code, you can't have an array with four columns labeled 35, 127, 33, and 'other'. The cleanest way to deal with this is to have a routine that would translate each token (or category of tokens) to its corresponding column in the transition table--effectively a switch statement or sequence of if-statements that map the tokens (or token categories) to the range 0..3.

(b) Think about the task of extracting words from a stream of text. In Java, StringTokenizer does this for you, but sometimes you need to specify "words" idiosyncratically (as with the DVD information). You can do this kind of input-parsing task much more easily using state machines than by writing code directly.

(b.1) Draw a state transition diagram that accepts words defined as follows: a sequence of non-separator characters whose end is marked by a separator. Separators are symbols that separate English words--space, comma, semicolon, colon, and so on. Note that the hyphen (-), the apostrophe ('), and the percent sign (%) are not separators: treat "mother-in-law," "don't," and "23%" as single words. The end of the line is a separator, unless the last word of the line ends with a hyphen. That way, if a word like mother-in-law is hyphenated across two lines, it will still count as one word. (We will assume that in our input, only words that are always hyphenated will be hyphenated at the end of a line; that is, you should not expect normally-unhyphenated words to be broken across two lines.) Watch for multiple separators in a row--for example, a comma followed by a space is two separators, but there is no word between them.

You could code up this FSA into a method called getNextWord, and call it to parse a stream of input. Coding this isn't a required part of this assignment, though.

(b.2) Write a transition table for the state machine you drew in part (b.1).

(c) Now it's time to think about finite-state machines and the DVD information fields defined in the Sixth Homework.

(c.1) Draw a state transition diagram that accepts DVD information. You should design your machine to accept a single field--maybe a quoted string, maybe an integer, maybe a date--and to go back to the initial state when it encounters a comma (that isn't quoted, of course). This makes processing quite simple so long as you're willing to forego checking which field is of which type, or that you have the correct number of fields. (In coding, you could easily add actions for some transitions that would maintain a field count. It might also help to assume that there's an input token or character called EOS, for "end of string," that your character-reading routine would return and that your machine could check for.)

(c.2) For extra credit, recode your DVD-parsing program to implement the FSA you designed above.

(d) (This part is optional, but don't stop here; subsequent parts of this homework are required.) Available on the web is a program called JFLAP, written at Duke University (http://www.cs.duke.edu/~rodger/tools/jflaptmp/). You can download this Java application and use it to build and test your own simple FSAs (as well as do other formal-language activities). Other state machine simulators are available on the web; you can find some of them by using search strings like "state machine applet" or "FSA animation." If you'd like to work on building or enhancing tools like these (especially to allow graphical construction of useful FSAs with regular expressions as the transitions), come talk to me some time.

(e) The programming language Lisp (whose name is a contraction of the words "LISt Processing") was invented by John McCarthy in 1958. It was such an advanced language for its time that existing machines could not run it efficiently, and its early use was mostly limited to researchers in artificial intelligence. Today, however, computers are thousands of times faster than they were in the 1950s, and Lisp's power is practical for a very wide range of programming tasks. Scheme and Common Lisp are two modern members of the Lisp family of programming languages.

One of Scheme's attractions is that its syntax is very simple. Unlike Java, which has a few dozen different statements, each with its own grammar and punctuation rules, every program or expression in Scheme is just a list of words surrounded by parentheses. This provides a rich variety of expression because a "word" can be (a) any sequence of characters delimited (separated from other words) by white space, or (b) a parenthesized list of words nested within the outside list. The following are all valid Scheme expressions (each is one line long except the last, which starts with the word define):

(Fee fie fo fum) (+ 3.14159 1776 -45 quantity) (equal? (+ 2 2) (+ 3 1)) (define square (lambda (x) (* x x)))

Novice Scheme programmers sometimes worry about keeping all the parentheses balanced, but most Scheme systems have "syntax-based" text editors that automatically keep track of the parentheses, so that any time you type a right parenthesis it automatically flashes the left parenthesis that matches it. That way you can see effortlessly what matches what. (This idea has found its way into some program editors for Java and other languages, where it's also useful.)

Suppose you decide to write a syntax-based editor for Scheme, and as your first task you want to write some code that checks whether the parentheses are balanced in a Scheme expression. Astutely, you start by designing a FSA. To make it truly a finite-state machine, we have to put an upper limit on the depth to which parentheses can be nested; the example below shows the FSA for an upper limit of three-deep nesting. (In the diagram, "other" means an input symbol other than an open or close parenthesis.)

(e.1) After scanning the entire Scheme expression, in what state should your machine be if the parentheses were correctly balanced?

(e.2) This FSA works fine in theory, but for a realistic nesting depth of a few dozen, the diagram would be tediously repetitious. So you decide to simplify things and encapsulate the state information in a simple integer counter. Then you can have a single state on the page, and all the action happens in the transition steps, where you increment the counter for each left parenthesis and decrement it for each right parenthesis. (Having a variable may appear to violate the definition of a finite-state machine, all of whose information is encapsulated in a finite number of states. But since integer variables on computers (as opposed to integers in mathematics) always have a finite upper bound, we're technically safe. If our machine used a stack to keep track of the unbalanced parentheses (which is what our integer counter is modeling), it would no longer be an FSA--it would be a PDA (push-down automaton), which can accept a broader class of languages.)

The modified (augmented) machine appears below.

In the augmented machine, being in the stop state is not enough to know that the Scheme program has balanced parentheses; the value of the counter must be considered as well. What should the counter's value be if the machine accepts the Scheme source (that is, if the parentheses are correctly balanced)? What must have happened for the machine to end up in the error state?

(e.3) Things are never quite as simple as they first seem. Comments in Scheme programs start with a semicolon and extend to the end of the line. Thus, the following is a valid Scheme expression; everything to the right of the semicolon on each line is a comment. Of course the contents of comments are ignored when checking for balanced parentheses.

(define print-it ;In this routine we (lambda (p) ; a) accept a parameter, (display p) ; b) display it, and (newline))) ; c) hit carriage return

Draw a new FSA-like machine, similar to the one above, to account for comments correctly; you will have to add more states.

(e.4) And there's one more wrinkle. Literal character strings in Scheme are enclosed in double-quote marks. As in Java, the contents of literal strings are ignored when analyzing the syntax of the program. The following three expressions are valid in Scheme.

(display "Oh; really?") (list "a)" "b)" "c)" ) (let ((delims ".,;:)(("))) ; This has an extra '(' in quotes

Draw a new FSA-like machine to handle both strings and comments correctly.

(e.5) Write a transition table for the state machine you designed in part (e.4). Note that some of the transitions in some conditions will also increment or decrement the count of parentheses.

(e.6) Test your FSA from part (e.4) thoroughly on paper--devise a thorough test plan and work each test through your FSA.

(f) A grammar is a set of rules that can generate all the strings in a formal language. In the right form, a grammar for a programming language can be used with other software to produce automatically part of a compiler for that language.

Below is a grammar (in Backus-Naur Form, or BNF notation) that describes arithmetic expressions:

<expression> ::= <real> | <variable> | ( <expression> ) |
   <expression> <operator> <expression> |
   ( <variable> = <expression> )<real> ::= <positive-real> | - <positive-real>
<positive-real> ::= <integer-part> | <integer-part> . <integer-part>
<integer-part> ::= <digit> | <digit> <integer-part>
<variable> ::= <letter>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9<letter> ::= a | b | c | d | e | f | g | h | i | j | k | l | m |    n | o | p | q | r | s | t | u | v | w | x | y | z<operator> ::= + | - | * | / | %

(f.1) Some of the following expressions can be generated by this grammar; others can not. Indicate which are the valid expressions. (The easiest way to do this might be to photocopy the page, or print it from the on-line version, and circle the valid expressions.)

(f.2) Using the grammar, generate four more expressions that aren't on the above list. Each expression should involve applying at least ten rules. For each expression, show its derivation tree (with <expression> at the root and terminal symbols--i.e., without angle brackets--at the leaves).

(f.3) Give three arithmetic expressions that are syntactically valid in Java but are not generated by this grammar.

(f.4) Modify the grammar to allow multi-letter variable names. This requires changing only one of the existing rules.

(g) Write regular expressions to match each of the following patterns. Note that these are natural language descriptions, so they will certainly be ambiguous; disambiguate them as you see fit and note what decisions you made. In some cases you may not be able to match the described set perfectly; don't obsess over it.

* Comma-separated dollars-and-cents amounts (e.g., $1,234.56 and $17)

* Lines that are empty or all blanks. (The caret ("^") matches the start of a line and the dollar sign matches the end of the line.)

* Email addresses

* URLs in HTML anchor tags (e.g., <A href="http://www.ics.uci.edu/~kay">)

* Lines containing exactly one integer (perhaps surrounded by non-numeric characters)

(h) Write a program that generates random sentences according to a user-supplied grammar, as specified below. The final product doesn't require a lot of code, but it does require careful, thoughtful design in advance. You may do this assignment in pairs. Each member of a pair should turn in the (identical) program via Checkmate; each source code file should have a comment at the top that says something like "// Joint work of Carl Coder and Petra Programmer."

(h.1) Get an idea of what this assignment can do by trying out the applet at http://www-cs-faculty.stanford.edu/~zelenski/rsg/ . The "Extension Request" grammar (the default) is pretty funny; also try out "CS assignment," "Programming bug," and "Math expression," along with any others that strike your fancy.

(h.2) Follow the "Directory of the collected grammar files" link at the bottom of the page. Pick grammars that you chose in part (h.1) ("Math expression" is the easiest to follow) and look at them to get an idea of your program's input.

A grammar file for input to your program contains one or more rules of the following form:

-- Each rule starts with a left brace "{" on its own line and ends with a right brace "}" on its own line.

-- After the opening brace, the first line of the rule is its left-hand side; this non-terminal is a string delimited by angle brackets.

-- Subsequent lines of the rule are alternative productions, different ways of rewriting the left-hand side. Each production consists of non-terminals (enclosed in angle brackets) and terminals (other characters) in any combination, ending with a semicolon.

-- There may be lines of text outside of the braces that delimit the rules; those lines are ignored by the program (and thus can serve as comments in the grammar).

You may assume that the grammar files take this form; you do not have to check for errors.

(h.3) Write code to read grammar files and store the grammars. Use a symbol table (either a hash table or a BST) keyed on the non-terminals; the value of each entry in the table contains the non-terminal's alternative productions.

(h.4) Now, write code to generate sentences from the grammar. Each grammar contains one non-terminal symbol named <start>, which (obviously enough) is the start symbol for each derivation. As your program expands each non-terminal, it chooses at random one of the non-terminal's alternative productions, and so on recursively until every non-terminal is expanded.

You may assume that every non-terminal in the grammar will appear on the left side of exactly one rule; you do not have to check for undefined or multiply-defined non-terminals (though for a bit of extra credit you may check for and handle those issues and a missing start symbol).

Your output should include a hierarchical description of the derivation process as well as the final sentence, as shown in the following example.

Here is a simple grammar:

Here is the output, showing the final generated sentence at the bottom. The first level of indentation shows the first production taken (This <Y> !), the second level shows the expansion of <Y> (<Z> cool), and so on.

<start>

This

<Y>

<Z>

is really

cool

!

This is really cool !

(h.5) Design and build an interface. The simplest would be a console interface that prompts the user for the name of a grammar file and then generates a sentence from that grammar. Enhancements could include letting the user request new sentences repeatedly or specify a new grammar file. Building an applet or GUI application is another alternative.

(h.6) Make up a grammar for (a tiny subset of) Java and see what kinds of programs it generates. (You could copy your random program output into a Java environment like DrJava, not to run it but to get it automatically formatted to make it readable.)

(h.7) Make up at least one other grammar of your choice and generate some sentences with it.

(h.8) On the last day of class, bring a printed copy of your best, cleverest, or funniest grammar and some of the sentences it generates; we can share them (anonymously if you like).

(i) The GUI you may have built as extra credit suggested by the Sixth Homework is due at the same time this assignment is due.

What to turn in: For parts (a) through (g), which involve so many diagrams and tables, you will probably find it easiest to produce and submit your work on paper (clearly marked with your name, of course) and turn it in during section. Checkmate will accept a Word document for those parts, but please use Checkmate for those parts only if everything, including all the diagrams, is included in the electronic copy. Of course you may use Word to produce a printed copy onto which you draw some of your answers by hand, but we need everything in one place, not split between Checkmate and paper.

For part (h), turn in via Checkmate your Java code, your grammar for (partial) Java, and the other grammars you designed.

FSA exercises written by David G. Kay, Winter 1991 (based on materials from 1990 and earlier).
Revised by Joe Hummel, Norman Jacobson, Theresa Millette, Brian Pitterle, Alex Thornton, Rasheed Baqai, Li-Wei (Gary) Chen, and David G. Kay, 1992-1999
Revised to include BNF grammars by David G. Kay, Spring 1999.
Revised and consolidated by David G. Kay, Winter 2000; revised to add DVD information, Winter 2003.

Random sentence generator original concept by Mike Cleron of Stanford University; modified by Allison Hansen, Julie Zelenski, and others.
Revised and adapted by David G. Kay, Winter 2000 and Winter 2003.

	Secret1	Secret2	Secret3	other
Init	`GotFirst`	`Init`	`Init`	`Init`
GotFirst	`Init`	`GotSecond`	`Init`	`Init`
GotSecond	`Init`	`Init`	`Success`	`Init`
Success