ICS 142 Winter 2004
Assignment #2

Due date and time: Friday, February 6, 11:59pm


Introduction

In the previous assignment, you were asked to implement a scanner for a hypothetical imperative-style programming language called Monkie2004. After making some simplifications to the language (to keep this assignment from being too large), I would now like you to build a parser for Monkie2004. Since the language has been changed, and since I would imagine that not everyone completed the scanner entirely and correctly, I will provide a compiled version of my scanner, which has been updated to reflect the changes made to the language since the first assignment, and which also includes some error-handling for situations like identifiers with pairs of adjacent underscores or integer literals that are too large.

Remember that the primary job of the parser is twofold: verifying that the input program is syntactically correct and, in the case of a correct program, beginning to discern the meaning of the input program by discovering a parse tree for it. It should be noted that a parser need not actually build and store a parse tree in memory; it just needs to discover the existence of one, then use that discovery to continue analyzing the input program. In future assignments, we'll connect parsers to modules that perform further analysis on the input program, such as type checking, code generation, and optimization.

For this assignment, I'm requiring you to build a hand-coded recursive descent parser in Java, following the pattern that we discussed in lecture. I'll also require you to do some theoretical work that is necessary in order for you to build your parser correctly.


What is a recursive descent parser?

A recursive descent parser is a parser that is constructed from a set of mutually recursive functions, each of which corresponds to one nonterminal symbol in a grammar. For example, consider the following simple grammar G:

The heart of a recursive descent parser for the language of G would be three functions, one for each nonterminal. Using only the next token of input, these functions decide which rule to expand by, then consume tokens and make calls to other functions as needed. For the grammar above, the functions would be implemented according to the following pattern:

parse_S():
    if (next token is 'a')
        // S -> aA
        consumeToken('a')
        parse_A()
    else if (next token is 'b')
        // S -> bB
        consumeToken('b')
        parse_B()
    else
        ERROR!
    
parse_A():
    if (next token is 'b')
        // A -> bc
        consumeToken('b')
        consumeToken('c')
    else if (next token is 'c')
        // A -> cB
        consumeToken('c')
        parse_B()
    else
        ERROR!
    
parse_B():
    if (next token is 'c')
        // B -> cd
        consumeToken('c')
        consumeToken('d')
    else if (next token is 'd')
        // B -> dA
        consumeToken('d')
        parse_A()
    else
        ERROR!
    

The presumption here is that the consumeToken( ) operation does two things:

  1. Check whether the next token of the input is the desired token. If not, signal an error.
  2. Move to the next token of the input.

Calling parse_S( ) and having it run to completion without signaling an error indicates the existence of a valid parse tree for the input program. By including additional code within this framework, the parser can be upgraded to build a parse tree, an abstracted version of a parse tree called an abstract syntax tree, and/or a variety of other forms of output. (A typical compiler will build one or more intermediate representations of the program during parsing, then pass them on to subsequent stages of the compiler.)

Notice that this framework does not use backtracking. Every time it needs to expand some nonterminal, it chooses one of the right-hand sides of that nonterminal's rules; the penalty for choosing a rule that does not correspond to the input is an error, which presumably stops the parser from proceeding. Grammars that can be parsed using this technique are called LL(1) grammars or predictive grammars. (The term LL(1) arises from the fact that such grammars can be parsed by doing a Left-to-right scan of the input and building a Leftmost derivation, using at most 1 token of lookahead.) Naturally, not all grammars can be parsed this way. Here's one example:

This grammar is not LL(1), and thus cannot be parsed using a recursive descent parser, since it has left recursion in it. Specifically, the rule AA b can cause a recursive descent parser to go into infinite recursion, as the corresponding parse_A( ) routine might call parse_A( ) without consuming any input, then call parse_A( ) again without consuming input, and so on. (Left recursion comes in two flavors: immediate left recursion, shown above, and indirect left recursion. Techniques for eliminating both forms of left recursion were discussed in class and are discussed in the textbook.)

Another problem that prevents a grammar from being LL(1) is demonstrated in this grammar:

The right-hand sides of the two rules for S both begin with an a, meaning that it will be impossible to choose one of the rules by looking at only the next token of input. The solution to this problem is left factoring, which we discussed in lecture and is discussed in some detail in the textbook.

Left recursion elimination and left factoring are not necessarily enough to make a grammar have the LL(1) property; some languages cannot be expressed with LL(1) grammars. Fortunately, most programming language constructs can be expressed using LL(1) grammars, making them useful in parsers for programming languages.


The updated Monkie2004 language for this assignment

Monkie2004 is a simple imperative-style language. A Monkie2004 program is a sequence of global variable declarations, procedures, and functions. The distinction between procedures and functions is the same as the distinction between a void method in Java and one that returns a value. Each procedure and function consists of a signature (a name, a parameter list, and -- in the case of a function -- a return type), then a block statement, which is one or more statement surrounded by a matched pair of brackets (i.e. '[' and ']'). There are a few kinds of statements in Monkie2004: local variable declarations, assignments, procedure calls, if statements, while loops, and block statements.

A few of the keywords that were present in the scanner are no longer considered a part of the language: and, call, implies, not, or, xor. Two new keywords, true and false, have been added. The <= and >= operators have been removed, while an integer negation operator ~ (analogous to Java's unary minus) has been added. Integer literals may no longer contain negative signs; instead, the integer negation operator should be used to specify negation, so the integer -3 is represented in Monkie2004 as ~3.

It should be noted that Monkie2004 is case-sensitive. Keywords must appear in all lowercase, and the identifiers result and Result are considered different.

What follows is an unambiguous, but not LL(1), grammar for Monkie2004. Nonterminal symbols are indicated by capitalized, italicized words, such as Program or BlockStatement. Terminal symbols are indicated by boldface words or symbols, such as while or (. The start symbol for the grammar is Program.


Part 1: Correcting the provided Monkie2004 grammar (25 points)

The first step in building a recursive descent parser for Monkie2004 is to rewrite the grammar so that it is LL(1). The provided grammar has multiple instances of two kinds of problems in it: immediate left recursion and the need for left factoring. Using the techniques we discussed in lecture (and also in the textbook), rewrite the grammar so that it is an LL(1) grammar that recognizes the same language. You may write the grammar using Microsoft Word and submit it as a .doc file, or you may write it in any other tool you wish and convert it to PDF format instead.


Part 2: Computing FIRST, FOLLOW, and FIRST+ sets for your grammar (25 points)

The next step in building a recursive descent parser is to compute its FIRST, FOLLOW, and FIRST+ sets for your rewritten Monkie2004 grammar. You may use the algorithm given in the textbook, though I suggest using a less structured approach, since these can be determined effectively by eyeballing the grammar.

Remember that FIRST, FOLLOW, and FIRST+ sets are defined as follows:

As an example, consider one of the grammars given earlier in the write-up:

As with your rewritten grammar, you may write your FIRST, FOLLOW, and FIRST+ sets using Microsoft Word, or using any other tool so long as you convert it to PDF format before submitting it.


Part 3: Building your recursive descent parser (50 points)

Now that you've rewritten the Monkie2004 grammar to be LL(1) and computed FIRST, FOLLOW, and FIRST+ sets, you have all the information you need to implement your recursive descent parser, using the pattern described in your textbook and discussed in lecture.

Your parser should print, as output, an indication of what procedures, functions, and statements it recognized. For example, given the following input program:

var globalInteger: integer;

procedure program()
[
    var x: integer;
    x <- 10;

    while x > 0 do
    [
        x <- x - 1;
        
        if x == 0 then
        [
            print("done");
        ]
    ]
]

...your parser should produce output in the following form:

variable declaration
procedure
[
    variable declaration
    assignment
    while loop
    [
        assignment
        if statement
        [
            procedure call
        ]
    ]
]

Your parser is not required to (and should not) build a parse tree or any intermediate representation of the program; output should be generated on the fly as statements are recognized. Output should use indentation to convey membership of statements in block statements, as shown in the example output above.

As in the previous assignment, you are required to provide a driver program in a class called Driver in a file called Driver.java, so that we can compile and run your program with the following sequence of commands:

    javac *.java
    java Driver inputfile.m

...where inputfile.m is the name of a Monkie2004 program. I've provided such a Driver program in the starting point, which you may modify if you wish, though it must still behave in a way that allows us to run your program using the commands specified above.

Unlike in the previous assignment, we will attempt to run your program using Monkie2004 programs that have errors in them. In the case of an erroneous input program, your parser may print an error message and quit as soon as the first error is discovered. It is not necessary to provide an error message that indicates the nature of the problem (though you may, if you'd like). You are required, however, to print the line and column of the token that caused the problem. I suggest throwing a ParserError (a class I've provided), catching it in your driver class, and doing this:

    System.out.println(e.getMessage());

To get you started, I'm providing a starting point, consisting of compiled versions of my Scanner, Token, and ScannerError classes, along with a skeleton for your parser in Parser.java (including some helper methods, such as consumeToken( ), that you will find useful), a ParserError class that you will likely find useful for reporting errors, and the complete source code for your Driver class. Here's a link to the starting point:

A little documentation for the compiled classes that I've provided will be necessary, so here goes...


Deliverables

You must submit your two Microsoft Word or PDF documents -- one containing your grammar and another containing your FIRST, FOLLOW, and FIRST+ sets. Additionally, you are to submit all of the .java files that comprise your program, including those that have been provided. You should not submit the .class files provided to you, nor should you submit any compiled versions of your own code or other files generated by your development environment.

Follow this link for a discussion of how to submit your assignment. Remember that we do not accept paper submissions of your assignments, nor do we accept them via email under any circumstances.


Limitations

In case it isn't obvious from the rest of the write-up, I expect you to hand-code your parser. Use of automated tools to build your parser is strictly forbidden, and will result in an automatic score of 0 on this assignment.