ICS 142 Winter 2004
Assignment #3

Due date and time: Monday, February 23, 11:59pm


Introduction

The tasks of scanning and parsing are largely centered around what a program in a language is permitted to look like, often called the language's syntax. Of course, just because a program is syntactically correct doesn't make it a legal program. The semantics of a programming language are rules that describe the meaning of syntactically-legal programs. Additionally, semantic rules may render many syntactically-legal problems as erroneous, for a variety of reasons. For example, the following Java program is syntactically legal:

    public class A
    {
        public static void main(String[] args)
        {
            int i = 0;
            System.out.println(j);
        }
    }

A parser will accept this program as a legal program according to the rules of Java syntax. But, of course, we know that this program is not a legal Java program, since it attempts to print the value of an undeclared variable j. It can be shown that context-free grammars cannot be used to express rules such as one that requires a variable to be declared before use. Rather than complicate parsing by using a more powerful formalism than context-free grammars, it makes more sense to let parsers do what they do best -- check for syntactic correctness -- and defer the checking of semantic rules to a later stage of compilation, using more traditional programming techniques.

In this assignment, you will implement a semantic analyzer for the Monkie2004 language, which will check for violations of semantic rules, such as in the example above. To facilitate this analysis, you will first augment a provided parser with actions that build an intermediate data structure called an abstract syntax tree. Thanks to Java's object-oriented features, the code to perform these tasks will be remarkably concise, if designed well; a substantial portion of the design is provided to you in the starting point to help you stay on the right track.


CUP: an LALR parser generator for Java

CUP is an LALR parser generator. It takes as input a script that describes a parser, which is centrally built around a grammar for some language. Its output is two Java classes that, together, implement an LALR parser for the language. An LALR parser is a shift-reduce parser that uses similar, but significantly optimized (and, as it turns out, slightly less powerful) techniques to those we discussed in lecture for generating the LR(1) parsing tables it needs.

The general structure of a CUP script

A CUP script is broken informally into four parts: import and package directives, customization code, a listing of symbols and their associated types, and a grammar annotated with actions. An example follows:

// Import and package directives
import java_cup.runtime.*;


// Customization code
action code
{:
    public void output(Integer dohCount)
    {
        System.out.println("The DOH! count is " + dohCount.intValue());
    }
:}


// Symbol listing
terminal              DOH;
nonterminal           Goal;
nonterminal Integer   Homer;

start with Goal;   // specification of the unique start symbol


// The grammar, with annotated actions.

Goal ::= Homer:h
            {: output(h); :}
;

Homer ::= DOH Homer:h
            {: RESULT = new Integer(1 + h.intValue()); :}
        | DOH
            {: RESULT = new Integer(1); :}
;

When combined with a scanner that returns DOH tokens, this CUP script generates a parser that counts the number of DOH tokens in the input program and outputs that total to System.out.

How does a CUP-generated parser work?

CUP generates a shift-reduce parser, very similar to the ones we described in class. The parser will maintain a stack of parser states and symbols, then make a sequence of decisions to either shift or reduce, in each case based only on the next token of input. If all you wanted was an indication of whether the sentence was legal or not, that would be all you would need: if the input was accepted by the parser, you would see no output at all; if the input was not accepted by the parser, you would see a simple message such as "Parse error" and the parser would quit as soon as the error was detected.

However, in a complete compiler, a parser's job is not just to determine acceptance or rejection of the input. It also must do additional work to seed future stages of compilation with a boiled-down version of the input program. This means that additional calculations, such as building an abstract syntax tree or maintaining a collection of all declared variables and their types, must be done simultaneously. At first, this might seem to complicate matters tremendously: How can we nicely split up the work of parsing from these additional calculations? In a shift-reduce parser, there's a sensible time to do this kind of work: whenever a reduction occurs! At reduction time, two things are true:

  1. The parser has just completely recognized, say, a VariableDeclaration, but hasn't yet recognized any of the code that follows it.
  2. All of the relevant information about the VariableDeclaration, such as the names of the identifiers before and after the colon, are available on the parser stack.

Given these two facts, a reduction to VariableDeclaration could be accompanied by the creation of an object that stores information about the variable declaration (the identifiers for the variable's name and type). After this action is taken, parsing should continue normally.

To support these kinds of actions, every symbol on the parser stack is accompanied by a value represented by a Java object. Different types of symbols are accompanied by different types of objects; some are not accompanied by objects at all. The symbol listing in the CUP script specifies which symbols are accompanied by which types. In the example above, each Homer symbol is accompanied by an Integer object, while other symbols are not accompanied by any object.

Here's a step-by-step example of the above CUP parser working on the input string doh doh doh. For the sake of clarity, I've left the parser states out of the example. Each occurrence of Homer in the parser stack is accompanied by an Integer object, which is shown in the example with a / (i.e. if a Homer is accompanied by the Integer 3, it is shown on the stack as Homer/3). The bottom of the stack is represented by [[.

Parser Action Stack Remaining Input
[[ doh doh doh
shift [[ doh doh doh
shift [[ doh doh doh
shift [[ doh doh doh
reduce Homer → doh [[ doh doh Homer/1
reduce Homer → doh Homer [[ doh Homer/2
reduce Homer → doh Homer [[ Homer/3
reduce Goal → Homer
(3 is printed to System.out)
[[ Goal

The effect of storing objects with every symbol on the parser stack and executing actions on each reduction is that each rule in the grammar acts, in some ways, like a function in a programming language. When a rule is reduced, information accompanying the symbols on the right-hand side is synthesized and the result is stored with the symbol on the left-hand side. When an input program is completely parsed, all of this information will have been synthesized into one object associated with the start symbol. This turns out to be a very convenient way to build intermediate representations of a program, such as abstract syntax trees, which I'll discuss a bit later in this write-up.

Accessing the objects stored in the symbols on the stack requires giving names to them. For example, consider this fragment from the above example:

Homer ::= DOH Homer:h
            {: RESULT = new Integer(1 + h.intValue()); :}

When the rule Homer → doh Homer is used in a reduction, we want our action to add 1 to the value associated with the Homer on the right-hand side, then store the result into the Homer on the left-hand side. The object associated with the symbol on the left-hand side of a rule is always accessible via a variable called RESULT. Objects associated with symbols on the right-hand side of the rule can only be accessed if we give them names. In this example, we've given the Homer on the right-hand side the name h, by writing this: Homer:h. In the action for this rule, we can then refer to the object associated with the Homer on the right-hand side by simply using the name h.

But, wait, I want to see a complete example in action!

No problem! Below is a link to a complete version of the Homer example, including a JFlex scanner, a CUP parser, a driver program, and a sample input file.

To run the example, unzip the archive into a folder, start up a Command Prompt, change into the folder where you unzipped the archive, then run the following commands:

jflex homer.flex
cup homer.cup
javac *.java
java Driver inputfile

Using CUP in the ICS labs

CUP is already installed on the Windows workstations in the ICS labs. CUP is primarily driven via the command line, so to use it, start by bringing up a Command Prompt window. On the lab machines, CUP is not configured to work by default. Each time you start a Command Prompt, you'll need to load its settings by executing these two commands:

cd \opt\ics142
start142

(Note: On a few of the machines in the lab, the start142.bat file might be called start.bat instead. This is an installation bug that hasn't been eradicated fully, since all the machines aren't always on when it's time to update them. In that case, execute this command instead:

.\start

The dot and the backslash are necessary because start is actually a Windows shell command.)


Installing CUP at home

The prepared installation that I provided as part of Assignment #1 contained both JFlex and CUP. Assuming you installed that when you were working on Assignment #1, you should already be ready to use CUP. If not, check out the Assignment #1 write-up and follow the installation instructions therein.


Compiling a CUP script

You can compile a CUP script (e.g. blah.cup) like this:

java java_cup.Main blah.cup

This is cumbersome to type and also gives lousy names (that start with lowercase letters!) to the classes it generates. I prefer (and, in the case of this assignment, require) calling the parser class Parser and the symbol class Tokens, which requires a much longer command:

java java_cup.main -parser Parser -symbols Tokens blah.cup

Since that's way too cumbersome to type, I've provided a batch file called cup.bat in my starting point (and in the Homer example), which will allow you to use this command to compile a CUP script instead:

cup blah.cup

This will create two Java source files (e.g. Parser.java and Tokens.java). Don't forget to compile these .java files before executing your program!

(Unix/Linux users: You can write a shell script or use alias to accomplish the same goal as my Windows batch file.)


Changes to the Monkie2004 language for this assignment

The Monkie2004 language is much the same as it was in the last assignment, with a few operators added to the rules for expressions. (These are the operators I took out in order to keep the grammar simpler for your hand-coded parser. Now that we're using CUP to generate our parser for us, adding these operators does not overburden you with complexity, but makes Monkie2004 much more expressive.) No changes have been made to the existing operators, or any other part of the language.

The newly-added operators are:

The precedence and associativity of these new operators can be inferred from the (unambiguous) grammar in the CUP script provided in the starting point.


Static semantic rules of the Monkie2004 language

In addition to being lexically and syntactically correct, a Monkie2004 program must follow all of these static semantic rules in order to be considered valid. A program that violates any of these rules is considered invalid.


Wow, I've enjoyed reading all of this, but what am I supposed to do??

Your program will perform a complete static analysis of an input program. I've provided a working scanner and a working parser (with no actions in it), so you can assume that lexical and syntactic analysis is already working. Your task is to augment the parser to build an intermediate form called an abstract syntax tree, which will be passed to a semantic analyzer that you will write. When you're finished with the assignment, you'll have a program that's capable of finding any compile-time errors in a Monkie2004 program. I don't know about you, but I find that exceedingly cool!


Part 1: Augmenting the parser to build an abstract syntax tree (AST) (30 points)

What's an AST?

An abstract syntax tree (AST) is not a tree in the same sense as a binary search tree (i.e. all of the nodes are of the same type). It is, however, a tree-like structure that contains all of the relevant information in an input program. For example, consider this brief Monkie2004 program:

    var global: integer;

    procedure program()
    [
        global <- read_integer();
        print_string("You just typed in the number ");
        print_integer(global);
        print_endline();
    ]

An abstract syntax tree might be constructed as a list of definitions:

Before object-oriented languages like Java existed, building an abstract syntax tree was a difficult proposition, since different "nodes" of the tree represent different constructs in the input program (e.g. assignment statements, function calls, etc.) and, thus, needed to be stored in different kinds of structures. But using Java, building an abstract syntax tree is relatively straightforward, thanks to inheritance and polymorphism.

The entire abstract syntax tree can be stored in a list structure, such as ArrayList. Each object in the ArrayList would be some kind of Definition object, of which there are two subclasses, VariableDeclaration and SubprogramDeclaration. (SubprogramDeclaration objects would contain information about either a procedure or a function; of course, something in the object would have to tell you whether it represented a procedure or a function. One way to do that would be to store a String variable containing the identifier for the return type for functions, and containing null or the empty string for procedures.)

Continuing this line of thinking, you'd probably also want to have two other superclasses: Statement and Expression. Statement would have subclasses such as AssignmentStatement, BlockStatement, and IfStatement. Expression would have subclasses such as OrExpression and ConcatenationExpression. As examples, an AssignmentStatement object might contain a String for the identifier on the left and an Expression object for the expression on the right. A ConcatenationExpression might contain two other Expression objects, one each for the left and right operands.

Since it might be useful in error reporting for all of these nodes to keep track of the beginning of their location in the input program, you might event want to have Definition, Statement, and Expression extend from a superclass like Construct, where Constructs have two integers, line and column stored in them. This is what I did in my solution, and I've provided classes like Construct, Definition, Statement, and Expression in the starting point.

Building an AST in your CUP parser

Because an AST takes on a structure that's a simplified version of the structure of the parse tree, building an AST in your CUP parser is relatively straightforward. Each nonterminal symbol in the grammar should carry with it an AST node of some type; some will carry specific AST node types, such as IfStatement or WhileLoop, while others will carry less specific nodes type, such as Expression, Definition, or ArrayList. Each parser action does one of two things:

Here are some CUP code fragments that, together, create an AST node for a global variable declaration. Your solution may differ greatly from this, depending on how you design your AST node classes, but this should give you the general idea of the approach you should use.

    // In this case, there's nothing to be added to the AST node, since
    // VariableDeclaration's action (or something that led to it) presumably
    // created an AST node, say a VariableDeclaration object.

    Definition ::=
        VariableDeclaration:vd
            {: RESULT = vd; :}

    // ...

    // There isn't enough information to create an AST node for the
    // variable declaration here, since we would need to store the identifiers
    // for the variable's name and type.  So we'll have the TypeDeclaration
    // actions create the object, and just use this action to pass it up the
    // parse tree.

    VariableDeclaration ::=
        VAR TypeDeclaration:td SEMICOLON
            {: RESULT = td; :}

    // ...
    
    // Here, we create and pass up a VariableDeclaration object.  In my
    // solution, a VariableDeclaration object contained four fields:
    // the name of the variable, the name of the variable's type, and
    // the line and column of the beginning of the construct (which can
    // be discovered by asking the first identifier for its line and
    // column, which is supplied by the scanner).

    TypeDeclaration ::=
        IDENTIFIER:name COLON IDENTIFIER:typeName
            {: RESULT = new VariableDeclaration(
                   name.getValue(), typeName.getValue(),
                   name.getLine(), name.getColumn()); :}

Your task

Your task in Part 1 is twofold:

At the conclusion of your work on Part 1, your program will not output anything, but it will be ready for you to add your code for Part 2.

A couple of things you should know. The terminal symbols have already been given object types in the provided CUP script. These must not be changed, since the scanner is the one who creates and provides those objects. There are three kinds of objects that are associated with terminal symbols:

On the other hand, you'll need to specify the object types for the nonterminals, according to your AST design.


Part 2: Performing semantic checking on the AST (70 points)

Once you've built your AST, it's time to write your semantic analyzer. The semantic analyzer's job is to traverse the AST and call an analyze( ) method on each of the "nodes" therein. As we discussed in Part 1, there are three main types of nodes in the AST: Definition objects, Statement objects, and Expression objects.

In order to do this analysis, you need an additional supporting data structure called a symbol table. The symbol table keeps track of the declarations of all of the identifiers at a given point in the analysis of the input program, i.e. all variables, types, and subprograms. There are two main operations that you can perform on the symbol table: declare and lookup, where declare adds a new declaration to the symbol table, and lookup finds an existing one.

I've provided a complete, efficient implementation of a SymbolTable class in the starting point. To keep things very clear, since there are different kinds of declarations in a Monkie2004 program, I've separated the declare operation into three methods: declareVariable( ), declareSubprogram( ), and declareType( ) (the latter of which I doubt you'll use, since Monkie2004 does not allow type declarations). Similarly, there are three kinds of lookups: lookupSubprogram( ), lookupType( ), and lookupVariable( ), along with a fourth, lookupLocalVariable( ), that's useful at times. In addition, there are two methods, enterScope( ) and exitScope( ), that you'll need to call whenever you enter or exit a scope, respectively. Take a look at the comments in the SymbolTable.java file that I provided if you want to know more about this class. The symbol table is passed to all of the analyze( ) methods as a parameter, so that it will always be available during analysis.

With the symbol table already done, implementing your semantic analyzer will be a matter of writing the code for all of the analyze( ) methods that you left blank (or as return null;) in Part 1. For each analyze( ) method, you want to do two things:

It is important to point out that semantic errors should be reported by calling the reportSemanticError( ) inherited from the Construct class. (See the comments in the provided code for more details.) You should not throw an exception when a semantic error is found, since your program needs to find as many semantic errors as it can. This is discussed in the next section.


Error reporting and error recovery requirements

As provided, your program will automatically abort with an error message if a lexical error is detected by the scanner. Similarly, the CUP script is already configured to abort the program with an error message if a syntax error is detected by the parser. Both of these behaviors are acceptable and should be left as-is. You are neither required nor permitted to add error recovery into your parser. (Keep things simple. You have plenty to do as it is. :) )

Once a program is determined to be syntactically legal -- and an abstract syntax tree has been built -- your semantic analyzer is required to find as many errors as it can in the input program. It may not simply abort when it finds the first error.

Here is an example Monkie2004 program, with a number of semantic errors in it. (As a challenge, you might first want to find all the errors yourself, before reading on to see the sample output.)

var globalInteger: integer;

function factorial(n: integer): integer
[
    if n == "0" then
    [
        Result <- "1";
    ]
    else
    [
        Resul <- n * factorial("n - 1");
    ]
]

procedure program()
[
    var x: integer;
    x <- 0;

    print_string("factorial 1..10");
    
    while x do
    [
        print_string("factorial(");
        print_integer(x);
        print_strin(") is ");

        var fac: hoorah;
        fac <- factorial(x);

        print_integer(fac);
        print_endline();
        x <- x + 1;
    ]
]

Given this sample input program, my semantic analyzer gives the following output:

Semantic error @ 5:8 - incompatible operands to '=='
Semantic error @ 7:9 - source and target types of assignment do not match
Semantic error @ 11:9 - 'Resul' is not declared as a variable
Semantic error @ 11:22 - argument 1 to function 'factorial' is wrong type
Semantic error @ 22:5 - while loop condition must be boolean expression
Semantic error @ 26:9 - procedure 'print_strin' has not been declared
Semantic error @ 28:13 - 'hoorah' is not a type
7 semantic error(s) found

Your semantic analyzer is not required to print exactly these error messages, though it should find (at least) all of these problems for this input program. Your error messages must be understandable and must state what the error is, but they are not required to be exactly the same as mine.

Obviously, the preceding example does not contain every possible semantic error that can arise in a Monkie2004 program. Your program is required to find all of the errors implied by the static semantic rules listed previously in the write-up.

Ideally, your analyzer will not print too many spurious error messages. For example, consider the following Monkie2004 code fragment:

    var i: integer; var j: integer; var k: integer;
    var b: boolean; var c: boolean; var d: boolean;
    var s: string; var t: string; var v: string;

    -- assume that those variables are initialized somehow

    -- this expression could appear in lots of places
    (j + k) implies (d and then (v & "alex"))

There are a couple of problems with the expression:

Your analyzer should report those two errors, but not additional errors, such as the right operand of the implies expression being incorrect. The easiest way to ensure that your analyzer won't report too many error messages in this:


Starting point

A great deal of code has been provided in order to get you started. It is available in a Zip archive.


Deliverables

Place your completed CUP script and all of the .java files that comprise your program (including any that we gave you) into a Zip archive, then submit that Zip archive. You need not include the .java files created by CUP (Parser.java and Tokens.java), but we won't penalize you if you do. However, you should be aware that we'll be regenerating these ourselves during the grading process, to be sure that they really did come from your CUP script. Please don't include other files, such as .class files, in your Zip archive.

Follow this link for a discussion of how to submit your assignment. Remember that we do not accept paper submissions of your assignments, nor do we accept them via email under any circumstances.

In order to keep the grading process relatively simple, we require that you keep your program designed in such a way as it can be compiled and executed with the following set of commands:

    cup monkie.cup
    javac *.java
    java Driver inputfile.m

A word of advice

I know I always say this, but I can't stress it enough on this assignment: START EARLY!!! While the amount of code will not be as much as it sounds (my complete solution is approximately 2000 lines of code, but I'm providing you with a good percentage of it), there is quite a bit of complexity to get your mind around in order to implement this program. Starting early and getting your questions answered as they come up will be paramount. If you wait until the last day or two, then wonder what an abstract syntax tree is, you're almost certainly going to bomb this assignment.

On the other hand, by starting early and understanding the concepts one at a time, you'll likely find that, once you figure out the key concepts, the code can be written relatively easiliy. The assignment will be challenging, but tractable, and very satisfying in the end!

Good luck!


Limitations

You may not make changes to the Monkie2004 grammar that has been given to you, except for writing the actions and adding names to the symbols when you need to refer to their associated values. For example, for this rule in the given grammar:

    VariableDeclaration ::=
        VAR TypeDeclaration SEMICOLON
            {:   :}
    ;

...you may change it to look like this:

    VariableDeclaration ::=
        VAR TypeDeclaration:typeDec SEMICOLON
            {:  // action code goes here  :}
    ;

...but you may not restructure or rewrite the rules themselves in any way.

Naturally, you're required to add object types to the nonterminal symbols in the given CUP script, such as changing this:

    nonterminal                Program;

...to this:

    nonterminal ArrayList      Program;

Other changes to the CUP script are not permitted.