ICS 142 Winter 2004, Assignment #3

Due date and time: Monday, February 23, 11:59pm

Introduction

The tasks of scanning and parsing are largely centered around what a program in a language is permitted to look like, often called the language's syntax. Of course, just because a program is syntactically correct doesn't make it a legal program. The semantics of a programming language are rules that describe the meaning of syntactically-legal programs. Additionally, semantic rules may render many syntactically-legal problems as erroneous, for a variety of reasons. For example, the following Java program is syntactically legal:

    public class A
    {
        public static void main(String[] args)
        {
            int i = 0;
            System.out.println(j);
        }
    }

A parser will accept this program as a legal program according to the rules of Java syntax. But, of course, we know that this program is not a legal Java program, since it attempts to print the value of an undeclared variable j. It can be shown that context-free grammars cannot be used to express rules such as one that requires a variable to be declared before use. Rather than complicate parsing by using a more powerful formalism than context-free grammars, it makes more sense to let parsers do what they do best -- check for syntactic correctness -- and defer the checking of semantic rules to a later stage of compilation, using more traditional programming techniques.

In this assignment, you will implement a semantic analyzer for the Monkie2004 language, which will check for violations of semantic rules, such as in the example above. To facilitate this analysis, you will first augment a provided parser with actions that build an intermediate data structure called an abstract syntax tree. Thanks to Java's object-oriented features, the code to perform these tasks will be remarkably concise, if designed well; a substantial portion of the design is provided to you in the starting point to help you stay on the right track.

CUP: an LALR parser generator for Java

CUP is an LALR parser generator. It takes as input a script that describes a parser, which is centrally built around a grammar for some language. Its output is two Java classes that, together, implement an LALR parser for the language. An LALR parser is a shift-reduce parser that uses similar, but significantly optimized (and, as it turns out, slightly less powerful) techniques to those we discussed in lecture for generating the LR(1) parsing tables it needs.

The general structure of a CUP script

A CUP script is broken informally into four parts: import and package directives, customization code, a listing of symbols and their associated types, and a grammar annotated with actions. An example follows:

// Import and package directives
import java_cup.runtime.*;


// Customization code
action code
{:
    public void output(Integer dohCount)
    {
        System.out.println("The DOH! count is " + dohCount.intValue());
    }
:}


// Symbol listing
terminal              DOH;
nonterminal           Goal;
nonterminal Integer   Homer;

start with Goal;   // specification of the unique start symbol


// The grammar, with annotated actions.

Goal ::= Homer:h
            {: output(h); :}
;

Homer ::= DOH Homer:h
            {: RESULT = new Integer(1 + h.intValue()); :}
        | DOH
            {: RESULT = new Integer(1); :}
;

When combined with a scanner that returns DOH tokens, this CUP script generates a parser that counts the number of DOH tokens in the input program and outputs that total to System.out.

How does a CUP-generated parser work?

CUP generates a shift-reduce parser, very similar to the ones we described in class. The parser will maintain a stack of parser states and symbols, then make a sequence of decisions to either shift or reduce, in each case based only on the next token of input. If all you wanted was an indication of whether the sentence was legal or not, that would be all you would need: if the input was accepted by the parser, you would see no output at all; if the input was not accepted by the parser, you would see a simple message such as "Parse error" and the parser would quit as soon as the error was detected.

However, in a complete compiler, a parser's job is not just to determine acceptance or rejection of the input. It also must do additional work to seed future stages of compilation with a boiled-down version of the input program. This means that additional calculations, such as building an abstract syntax tree or maintaining a collection of all declared variables and their types, must be done simultaneously. At first, this might seem to complicate matters tremendously: How can we nicely split up the work of parsing from these additional calculations? In a shift-reduce parser, there's a sensible time to do this kind of work: whenever a reduction occurs! At reduction time, two things are true:

The parser has just completely recognized, say, a VariableDeclaration, but hasn't yet recognized any of the code that follows it.
All of the relevant information about the VariableDeclaration, such as the names of the identifiers before and after the colon, are available on the parser stack.

Given these two facts, a reduction to VariableDeclaration could be accompanied by the creation of an object that stores information about the variable declaration (the identifiers for the variable's name and type). After this action is taken, parsing should continue normally.

To support these kinds of actions, every symbol on the parser stack is accompanied by a value represented by a Java object. Different types of symbols are accompanied by different types of objects; some are not accompanied by objects at all. The symbol listing in the CUP script specifies which symbols are accompanied by which types. In the example above, each Homer symbol is accompanied by an Integer object, while other symbols are not accompanied by any object.

Here's a step-by-step example of the above CUP parser working on the input string doh doh doh. For the sake of clarity, I've left the parser states out of the example. Each occurrence of Homer in the parser stack is accompanied by an Integer object, which is shown in the example with a / (i.e. if a Homer is accompanied by the Integer 3, it is shown on the stack as Homer/3). The bottom of the stack is represented by [[.

Parser Action	Stack	Remaining Input
	[[	doh doh doh
shift	[[ doh	doh doh
shift	[[ doh doh	doh
shift	[[ doh doh doh
reduce Homer → doh	[[ doh doh Homer/1
reduce Homer → doh Homer	[[ doh Homer/2
reduce Homer → doh Homer	[[ Homer/3
reduce Goal → Homer (3 is printed to System.out)	[[ Goal

The effect of storing objects with every symbol on the parser stack and executing actions on each reduction is that each rule in the grammar acts, in some ways, like a function in a programming language. When a rule is reduced, information accompanying the symbols on the right-hand side is synthesized and the result is stored with the symbol on the left-hand side. When an input program is completely parsed, all of this information will have been synthesized into one object associated with the start symbol. This turns out to be a very convenient way to build intermediate representations of a program, such as abstract syntax trees, which I'll discuss a bit later in this write-up.

Accessing the objects stored in the symbols on the stack requires giving names to them. For example, consider this fragment from the above example:

Homer ::= DOH Homer:h
            {: RESULT = new Integer(1 + h.intValue()); :}

When the rule Homer → doh Homer is used in a reduction, we want our action to add 1 to the value associated with the Homer on the right-hand side, then store the result into the Homer on the left-hand side. The object associated with the symbol on the left-hand side of a rule is always accessible via a variable called RESULT. Objects associated with symbols on the right-hand side of the rule can only be accessed if we give them names. In this example, we've given the Homer on the right-hand side the name h, by writing this: Homer:h. In the action for this rule, we can then refer to the object associated with the Homer on the right-hand side by simply using the name h.

But, wait, I want to see a complete example in action!

No problem! Below is a link to a complete version of the Homer example, including a JFlex scanner, a CUP parser, a driver program, and a sample input file.

HomerExample.zip

To run the example, unzip the archive into a folder, start up a Command Prompt, change into the folder where you unzipped the archive, then run the following commands:

jflex homer.flex
cup homer.cup
javac *.java
java Driver inputfile

Using CUP in the ICS labs

CUP is already installed on the Windows workstations in the ICS labs. CUP is primarily driven via the command line, so to use it, start by bringing up a Command Prompt window. On the lab machines, CUP is not configured to work by default. Each time you start a Command Prompt, you'll need to load its settings by executing these two commands:

cd \opt\ics142
start142

(Note: On a few of the machines in the lab, the start142.bat file might be called start.bat instead. This is an installation bug that hasn't been eradicated fully, since all the machines aren't always on when it's time to update them. In that case, execute this command instead:

.\start

The dot and the backslash are necessary because start is actually a Windows shell command.)

Installing CUP at home

The prepared installation that I provided as part of Assignment #1 contained both JFlex and CUP. Assuming you installed that when you were working on Assignment #1, you should already be ready to use CUP. If not, check out the Assignment #1 write-up and follow the installation instructions therein.

Compiling a CUP script

You can compile a CUP script (e.g. blah.cup) like this:

java java_cup.Main blah.cup

This is cumbersome to type and also gives lousy names (that start with lowercase letters!) to the classes it generates. I prefer (and, in the case of this assignment, require) calling the parser class Parser and the symbol class Tokens, which requires a much longer command:

java java_cup.main -parser Parser -symbols Tokens blah.cup

Since that's way too cumbersome to type, I've provided a batch file called cup.bat in my starting point (and in the Homer example), which will allow you to use this command to compile a CUP script instead:

cup blah.cup

This will create two Java source files (e.g. Parser.java and Tokens.java). Don't forget to compile these .java files before executing your program!

(Unix/Linux users: You can write a shell script or use alias to accomplish the same goal as my Windows batch file.)

Changes to the Monkie2004 language for this assignment

The Monkie2004 language is much the same as it was in the last assignment, with a few operators added to the rules for expressions. (These are the operators I took out in order to keep the grammar simpler for your hand-coded parser. Now that we're using CUP to generate our parser for us, adding these operators does not overburden you with complexity, but makes Monkie2004 much more expressive.) No changes have been made to the existing operators, or any other part of the language.

The newly-added operators are:

The logical (boolean) operators and, or, not, and xor have been added. These operators do pretty much what you'd expect. It should be pointed out that and, or, and xor are not intended to be short-circuited (meaning that both operands will always be evaluated).
Short-circuited versions of and and or, called and then and or else, have been added, as well. (Short-circuiting does not affect your program at all, but it will likely affect future assignments.)
One more logical operator, implies, has been added. It takes two boolean operands, evaluating to true if either:
- ...its left operand is false.
- ...its left operand is true and its right operand is true.
Like and then and or else, implies is short-circuited.
The relational operators <= and >= have been added back into the language, with the behavior you would expect.

The precedence and associativity of these new operators can be inferred from the (unambiguous) grammar in the CUP script provided in the starting point.

Static semantic rules of the Monkie2004 language

In addition to being lexically and syntactically correct, a Monkie2004 program must follow all of these static semantic rules in order to be considered valid. A program that violates any of these rules is considered invalid.

All variables, procedures, and functions must be in scope wherever they are used. The following rules can be used to determine whether a variable, procedure, or function is in scope:
- The scope of a local variable begins with its declaration and continues to the end of the block statement it is declared in.
- The scope of a global variable begins with its declaration and continues to the end of the program.
- The scope of the formal parameters to a procedure or function is the entire block statement that makes up the body of the procedure or function.
- The scope of a procedure or function begins with its declaration and continues to the end of the program. Notably, this allows recursion, since a procedure is considered in scope at the point just before its body.
- Once declared, the identifier for a procedure or function may not be redeclared as a variable, procedure, or function. Any such redeclaration is considered to be an error.
- Once declared in some scope (including the global scope), a variable may not be redeclared in that same scope as a variable, procedure, or function. Any such redeclaration is considered to be an error.
- If a variable is declared at some scoping level (i.e. globally or in some block statement), it may be redeclared at a deeper scoping level (i.e. a block statement within the scope in which it was originally declared). Oddly, this rule applies to formal parameters, which may be redeclared within the body of a procedure or function, since the formal parameters are declared outside of the block statement that makes up the body of the procedure or function.
- When a variable is used, its declaration is found using static scoping rules, meaning that the local scope is searched first, its enclosing scope is searched next, and so on. If none of the enclosing scopes (including the global scope) contains a declaration for the variable, the use of the variable is considered to be an error.
In every function, there is a variable called Result that is implicitly declared, whose type represents the return value of the function. In order to return a value from a function, assign a value to Result, then let the function end normally (i.e. fall through the bottom of it).
- Result is in scope from the beginning to the end of the function. However, like formal parameters, it is legal to redeclare it within the function, which hides the declaration of Result (and makes it impossible to return a value).
There are three data types in Monkie2004: integer, string, and boolean. These identifiers are reserved, meaning that no variable, procedure, or function may be declared with these names. Any such declaration is considered to be an error.
There are seven pre-defined procedures and functions in Monkie2004, with the following signatures:
- function read_string(): string
- procedure print_string(s: string)
- function read_integer(): integer
- procedure print_integer(i: integer)
- function read_boolean(): boolean
- procedure print_boolean(b: boolean)
- procedure print_endline()
Not surprisingly, these pre-defined procedures and functions are used for console input and output. Their exact run-time semantics are not important for this assignment, though they're intended to do pretty much what you'd expect.
Every expression (or subexpression) in Monkie2004 is considered to have a type (integer, string, or boolean).
In an assignment statement...
- ...the variable on the left-hand side must be declared.
- ...the type of the expression on the right-hand side must be the same as the type of the variable on the left-hand side.
In an if statement, the type of the conditional expression must be boolean.
In a while loop, the type of the conditional expression must be boolean.
In a call statement, such as this:
```
    foo(i + 1);
```
...the identifier (in this case, foo) must be declared as a procedure. If the identifier has been declared as a function or variable, or if the identifier has not been declared at all, the call statement is considered to be an error.
In a call made in an expression, e.g.:
```
    x <- i + bar(j);
```
...the identifier (in this case, bar) must be declared as a function. If the identifier has been declared as a procedure or variable, or if the identifier has not been declared at all, the call is considered to be an error.
In a call to a procedure or function...
- ...the number of actual parameters must match the number of formal parameters.
- ...the type of each actual parameter must match the type of each formal parameter. Parameters are matched in the order listed, such that the type of the first actual parameter must match the type of the first formal parameter, the second must match the type of the second, and so on.
The type of a function call expression is the same as the return type of the function that has been called.
The type of an integer literal in an expression is integer.
The type of a string literal in an expression is string.
The type of the constants true and false in an expression are boolean.
The type of an identifier in an expression is the type of the variable it refers to. If the variable has not been declared, the use of the identifier in the expression is considered to be an error.
The operand to the ~ operator must be an integer. The type of a ~ expression is integer.
The operand to the not operator must be boolean. The type of a not expression is boolean.
The operands to the +, -, *, and / operators must both be integers. The types of these expressions are integer.
The operands to the & operator must both be strings. The type of a & expression is string.
The operands to the == and /= operators must be the same, though they can both be integer, string, or boolean. The types of these expressions are boolean.
The operands to the <, <=, >, and >= operators must be the same, and they must be either integers or strings. The types of these expressions are boolean.
The operands to the and, and then, or, or else, xor, and implies operators must both be boolean. The types of these expressions are boolean.

Wow, I've enjoyed reading all of this, but what am I supposed to do??

Your program will perform a complete static analysis of an input program. I've provided a working scanner and a working parser (with no actions in it), so you can assume that lexical and syntactic analysis is already working. Your task is to augment the parser to build an intermediate form called an abstract syntax tree, which will be passed to a semantic analyzer that you will write. When you're finished with the assignment, you'll have a program that's capable of finding any compile-time errors in a Monkie2004 program. I don't know about you, but I find that exceedingly cool!

Part 1: Augmenting the parser to build an abstract syntax tree (AST) (30 points)

What's an AST?

An abstract syntax tree (AST) is not a tree in the same sense as a binary search tree (i.e. all of the nodes are of the same type). It is, however, a tree-like structure that contains all of the relevant information in an input program. For example, consider this brief Monkie2004 program:

    var global: integer;

    procedure program()
    [
        global <- read_integer();
        print_string("You just typed in the number ");
        print_integer(global);
        print_endline();
    ]

An abstract syntax tree might be constructed as a list of definitions:

The first definition would be a variable declaration, declaring the variable global to be of type integer.
The second definition would be a procedure, declaring a procedure called program that contains a(n empty) list of parameters, and a block statement.
- The block statement would be a list of four statements: an assignment statement, and three procedure calls, with the relevant information about each stored in the "nodes."

Before object-oriented languages like Java existed, building an abstract syntax tree was a difficult proposition, since different "nodes" of the tree represent different constructs in the input program (e.g. assignment statements, function calls, etc.) and, thus, needed to be stored in different kinds of structures. But using Java, building an abstract syntax tree is relatively straightforward, thanks to inheritance and polymorphism.

The entire abstract syntax tree can be stored in a list structure, such as ArrayList. Each object in the ArrayList would be some kind of Definition object, of which there are two subclasses, VariableDeclaration and SubprogramDeclaration. (SubprogramDeclaration objects would contain information about either a procedure or a function; of course, something in the object would have to tell you whether it represented a procedure or a function. One way to do that would be to store a String variable containing the identifier for the return type for functions, and containing null or the empty string for procedures.)

Continuing this line of thinking, you'd probably also want to have two other superclasses: Statement and Expression. Statement would have subclasses such as AssignmentStatement, BlockStatement, and IfStatement. Expression would have subclasses such as OrExpression and ConcatenationExpression. As examples, an AssignmentStatement object might contain a String for the identifier on the left and an Expression object for the expression on the right. A ConcatenationExpression might contain two other Expression objects, one each for the left and right operands.

Since it might be useful in error reporting for all of these nodes to keep track of the beginning of their location in the input program, you might event want to have Definition, Statement, and Expression extend from a superclass like Construct, where Constructs have two integers, line and column stored in them. This is what I did in my solution, and I've provided classes like Construct, Definition, Statement, and Expression in the starting point.

Building an AST in your CUP parser

Because an AST takes on a structure that's a simplified version of the structure of the parse tree, building an AST in your CUP parser is relatively straightforward. Each nonterminal symbol in the grammar should carry with it an AST node of some type; some will carry specific AST node types, such as IfStatement or WhileLoop, while others will carry less specific nodes type, such as Expression, Definition, or ArrayList. Each parser action does one of two things:

...uses the information on the right-hand side of its rule to construct a new AST node of the appropriate type (or, in some cases, add to a list structure such as an ArrayList).
...passes the AST node from the right-hand side of its rule directly to the left-hand side with nothing done to it.

Here are some CUP code fragments that, together, create an AST node for a global variable declaration. Your solution may differ greatly from this, depending on how you design your AST node classes, but this should give you the general idea of the approach you should use.

    // In this case, there's nothing to be added to the AST node, since
    // VariableDeclaration's action (or something that led to it) presumably
    // created an AST node, say a VariableDeclaration object.

    Definition ::=
        VariableDeclaration:vd
            {: RESULT = vd; :}

    // ...

    // There isn't enough information to create an AST node for the
    // variable declaration here, since we would need to store the identifiers
    // for the variable's name and type.  So we'll have the TypeDeclaration
    // actions create the object, and just use this action to pass it up the
    // parse tree.

    VariableDeclaration ::=
        VAR TypeDeclaration:td SEMICOLON
            {: RESULT = td; :}

    // ...
    
    // Here, we create and pass up a VariableDeclaration object.  In my
    // solution, a VariableDeclaration object contained four fields:
    // the name of the variable, the name of the variable's type, and
    // the line and column of the beginning of the construct (which can
    // be discovered by asking the first identifier for its line and
    // column, which is supplied by the scanner).

    TypeDeclaration ::=
        IDENTIFIER:name COLON IDENTIFIER:typeName
            {: RESULT = new VariableDeclaration(
                   name.getValue(), typeName.getValue(),
                   name.getLine(), name.getColumn()); :}

Your task

Your task in Part 1 is twofold:

Design and write the classes for your AST. For each construct in the language, decide what information you'll need to store about it. Reuse other classes as much as you can. For example, the BlockStatement class would extend from Statement, and would likely contain a collection (say, an ArrayList) of other Statement objects. Use the Construct, Definition, Statement, and Expression classes that I gave you as a start, and do not modify them. The analyze( ) method that is inherited from Definition and Statement can simply be left blank; the analyze( ) method that is inherited from Expression (which returns a Type object) should just return null at this point. (You'll write these methods in Part 2.)
Add actions to the provided CUP grammar that build the AST. Place the node representing the entire AST into the Program nonterminal. I've provided a driver program that assumes that Program's nonterminal will be given an ArrayList of Definition objects; you should obey this convention.

At the conclusion of your work on Part 1, your program will not output anything, but it will be ready for you to add your code for Part 2.

A couple of things you should know. The terminal symbols have already been given object types in the provided CUP script. These must not be changed, since the scanner is the one who creates and provides those objects. There are three kinds of objects that are associated with terminal symbols:

TokenInfo. An object that contains two fields, a line and a column, with corresponding accessor methods: getLine( ) and getColumn( ).
IntTokenInfo (extends TokenInfo). An object that, in addition to the line and column from TokenInfo, provides an additional integer value. This type is used for INTEGER_LITERAL tokens, so that the scanner can pass along the token's integer value. You can access the integer value of an IntTokenInfo object by calling getValue( ) on it.
StringTokenInfo (extends TokenInfo). An object that, in addition to the line and column from TokenInfo, provides an additional String value. This type is used for STRING_LITERAL and IDENTIFIER tokens, so that the scanner can pass along the String value associated with these tokens. You can access the String value of a StringTokenInfo object by calling getValue( ) on it.

On the other hand, you'll need to specify the object types for the nonterminals, according to your AST design.

Part 2: Performing semantic checking on the AST (70 points)

Once you've built your AST, it's time to write your semantic analyzer. The semantic analyzer's job is to traverse the AST and call an analyze( ) method on each of the "nodes" therein. As we discussed in Part 1, there are three main types of nodes in the AST: Definition objects, Statement objects, and Expression objects.

In order to do this analysis, you need an additional supporting data structure called a symbol table. The symbol table keeps track of the declarations of all of the identifiers at a given point in the analysis of the input program, i.e. all variables, types, and subprograms. There are two main operations that you can perform on the symbol table: declare and lookup, where declare adds a new declaration to the symbol table, and lookup finds an existing one.

I've provided a complete, efficient implementation of a SymbolTable class in the starting point. To keep things very clear, since there are different kinds of declarations in a Monkie2004 program, I've separated the declare operation into three methods: declareVariable( ), declareSubprogram( ), and declareType( ) (the latter of which I doubt you'll use, since Monkie2004 does not allow type declarations). Similarly, there are three kinds of lookups: lookupSubprogram( ), lookupType( ), and lookupVariable( ), along with a fourth, lookupLocalVariable( ), that's useful at times. In addition, there are two methods, enterScope( ) and exitScope( ), that you'll need to call whenever you enter or exit a scope, respectively. Take a look at the comments in the SymbolTable.java file that I provided if you want to know more about this class. The symbol table is passed to all of the analyze( ) methods as a parameter, so that it will always be available during analysis.

With the symbol table already done, implementing your semantic analyzer will be a matter of writing the code for all of the analyze( ) methods that you left blank (or as return null;) in Part 1. For each analyze( ) method, you want to do two things:

Perform all semantic checks to make sure that the node represents a legal operation in Monkie2004. Naturally, how you do this checking depends on the kind of operation. For a local variable declaration, you'd check that the variable had not already been declared in this scope. For a call statement, you'd check that the right number and types of actual parameters were passed, and that the call is to a procedure and not a function.
For some nodes, such as global and local variable declarations, or procedure and function definitions, you'll want to add a declaration to the symbol table, once the declaration has been deemed legal.
In the case of an expression, you need to do one more thing: analyze( ) should return the type of the expression. This is necessary because performing analysis on nested expressions, such as (a + b) * (c + d), is done from the bottom-up; the types of the inner expressions are used as part of the analysis of the outer expressions.

It is important to point out that semantic errors should be reported by calling the reportSemanticError( ) inherited from the Construct class. (See the comments in the provided code for more details.) You should not throw an exception when a semantic error is found, since your program needs to find as many semantic errors as it can. This is discussed in the next section.

Error reporting and error recovery requirements

As provided, your program will automatically abort with an error message if a lexical error is detected by the scanner. Similarly, the CUP script is already configured to abort the program with an error message if a syntax error is detected by the parser. Both of these behaviors are acceptable and should be left as-is. You are neither required nor permitted to add error recovery into your parser. (Keep things simple. You have plenty to do as it is. :) )

Once a program is determined to be syntactically legal -- and an abstract syntax tree has been built -- your semantic analyzer is required to find as many errors as it can in the input program. It may not simply abort when it finds the first error.

Here is an example Monkie2004 program, with a number of semantic errors in it. (As a challenge, you might first want to find all the errors yourself, before reading on to see the sample output.)

var globalInteger: integer;

function factorial(n: integer): integer
[
    if n == "0" then
    [
        Result <- "1";
    ]
    else
    [
        Resul <- n * factorial("n - 1");
    ]
]

procedure program()
[
    var x: integer;
    x <- 0;

    print_string("factorial 1..10");
    
    while x do
    [
        print_string("factorial(");
        print_integer(x);
        print_strin(") is ");

        var fac: hoorah;
        fac <- factorial(x);

        print_integer(fac);
        print_endline();
        x <- x + 1;
    ]
]

Given this sample input program, my semantic analyzer gives the following output:

Semantic error @ 5:8 - incompatible operands to '=='
Semantic error @ 7:9 - source and target types of assignment do not match
Semantic error @ 11:9 - 'Resul' is not declared as a variable
Semantic error @ 11:22 - argument 1 to function 'factorial' is wrong type
Semantic error @ 22:5 - while loop condition must be boolean expression
Semantic error @ 26:9 - procedure 'print_strin' has not been declared
Semantic error @ 28:13 - 'hoorah' is not a type
7 semantic error(s) found

Your semantic analyzer is not required to print exactly these error messages, though it should find (at least) all of these problems for this input program. Your error messages must be understandable and must state what the error is, but they are not required to be exactly the same as mine.

Obviously, the preceding example does not contain every possible semantic error that can arise in a Monkie2004 program. Your program is required to find all of the errors implied by the static semantic rules listed previously in the write-up.

Ideally, your analyzer will not print too many spurious error messages. For example, consider the following Monkie2004 code fragment:

    var i: integer; var j: integer; var k: integer;
    var b: boolean; var c: boolean; var d: boolean;
    var s: string; var t: string; var v: string;

    -- assume that those variables are initialized somehow

    -- this expression could appear in lots of places
    (j + k) implies (d and then (v & "alex"))

There are a couple of problems with the expression:

The left operand to the implies expression is (j + k), which evaluates to an integer.
The right operand to the and then expression is (v & "alex"), which evaluates to a string.

Your analyzer should report those two errors, but not additional errors, such as the right operand of the implies expression being incorrect. The easiest way to ensure that your analyzer won't report too many error messages in this:

When you have an expression involving an operator, type-check the operands, reporting any errors you find. Then assume that the overall expression's type is whatever it's supposed to be (e.g. the result of 3 & 8, though erroneous, will be considered to be of type string).
When you have an expression whose type you can't determine, such as an undeclared variable, consider its type to be no type, most easily implemented in Java as a null Type reference. In all of your type-checking code, never report errors caused by an expression's Type being no type. (Of course, some care is required if you choose this approach, since you might easily cause NullPointerExceptions by forgetting to check for null. You might instead use some other approach, such as a special "no type" object.)

Starting point

A great deal of code has been provided in order to get you started. It is available in a Zip archive.

StartingPoint.zip

Deliverables

Place your completed CUP script and all of the .java files that comprise your program (including any that we gave you) into a Zip archive, then submit that Zip archive. You need not include the .java files created by CUP (Parser.java and Tokens.java), but we won't penalize you if you do. However, you should be aware that we'll be regenerating these ourselves during the grading process, to be sure that they really did come from your CUP script. Please don't include other files, such as .class files, in your Zip archive.

Follow this link for a discussion of how to submit your assignment. Remember that we do not accept paper submissions of your assignments, nor do we accept them via email under any circumstances.

In order to keep the grading process relatively simple, we require that you keep your program designed in such a way as it can be compiled and executed with the following set of commands:

    cup monkie.cup
    javac *.java
    java Driver inputfile.m

A word of advice

I know I always say this, but I can't stress it enough on this assignment: START EARLY!!! While the amount of code will not be as much as it sounds (my complete solution is approximately 2000 lines of code, but I'm providing you with a good percentage of it), there is quite a bit of complexity to get your mind around in order to implement this program. Starting early and getting your questions answered as they come up will be paramount. If you wait until the last day or two, then wonder what an abstract syntax tree is, you're almost certainly going to bomb this assignment.

On the other hand, by starting early and understanding the concepts one at a time, you'll likely find that, once you figure out the key concepts, the code can be written relatively easiliy. The assignment will be challenging, but tractable, and very satisfying in the end!

Good luck!

Limitations

You may not make changes to the Monkie2004 grammar that has been given to you, except for writing the actions and adding names to the symbols when you need to refer to their associated values. For example, for this rule in the given grammar:

    VariableDeclaration ::=
        VAR TypeDeclaration SEMICOLON
            {:   :}
    ;

...you may change it to look like this:

    VariableDeclaration ::=
        VAR TypeDeclaration:typeDec SEMICOLON
            {:  // action code goes here  :}
    ;

...but you may not restructure or rewrite the rules themselves in any way.

Naturally, you're required to add object types to the nonterminal symbols in the given CUP script, such as changing this:

    nonterminal                Program;

...to this:

    nonterminal ArrayList      Program;

Other changes to the CUP script are not permitted.