ICS 142 Winter 2004, Assignment #5

Introduction

Before we can write programs in any programming language, we first must decide how to map the problem into the abstractions provided by the programming language we intend to use. In the case of a language like Java, that means we have to make object-oriented design decisions, taking a data-centric viewpoint that decides what kind of objects will comprise the system and how these objects will interact. Similarly, before the back end of a compiler can begin generating intermediate code or target code that is equivalent to some source program, it's necessary to map the abstractions provided by the source language into the (probably lower-level) abstractions provided by the intermediate language or target language.

For a lower-level intermediate code such as ILOC (which is presented in the textbook and was discussed in lecture), some of the first decisions that will need to be made center around the use of memory, though there are a variety of other decisions that will need be made, as well. If we were compiling Java and decided to use ILOC as an intermediate language, we'd have to consider many issues, such as:

How to represent objects in memory.
Where to put static variables and how to lay them out.
How to manage heap-allocated variables (including both allocation and garbage collection).
How to manage the calling of methods (including parameter passing, return values, and the saving and restoring of the caller's state).
How to implement dynamic binding of method calls (i.e. polymorphism).
If and how to represent and manage multiple threads.

Fortunately, for a language such as Monkie2004, the number of decisions that need to be made is much smaller. This assignment will explore a few of them: the placement of variables in memory and their subsequent use in expressions and assignment statements.

Syntax-directed analysis

In the previous two assignments, the output of the parser was an abstract syntax tree that represented all the meaningful information in the source program. The code that built up the abstract syntax tree was embedded into the grammar, with actions included in the CUP script that built nodes and passed them up the parse tree. After building the abstract syntax tree, we performed semantic checking on it in Assignment #3, then interpreted it in Assignment #4.

It should be noted that, for the work that was done in Assignment #3, there was another strategy that would also have worked. The analysis itself could have been embedded into the actions in the CUP script, rather than building the entire AST first, then performing analysis on it. (The primary reasons for having you build the AST were to keep your analysis separate from your parser while you were still learning the details of how to use CUP, and also to seed your work on the interpreter in Assignment #4.) For example, assuming that there was a global symbol table called st available to all the actions in the CUP script, the following action could have been embedded into the rule for an addition expression, assuming that a Type object was associated with each Expression. (I've simplified things somewhat for the purposes of the example.)

  Expr5 ::=
      Expr5:e5 ADDITION_OP Expr6:e6
          {:
              if (e5 != st.lookupType("integer") || e6 != st.lookupType("integer"))
              {
                  reportSemanticError("both operands in addition must be integers");
                  RESULT = null;
              }
              else
              {
                  RESULT = st.lookupType("integer");
              }
          :}
  ;

Similarly, the rest of the rules in the grammar could have contained actions that performed semantic checking while the program was being parsed. At the conclusion of parsing the program, then, semantic checking could be complete.

Syntax-directed analysis is the performing of analysis on a source program as it's being parsed. While we didn't use that strategy for our semantic analyzer, there are other analyses that we might perform on the program during parsing. In order to set you up for this, however, I first need to introduce you to a feature of CUP that we haven't discussed before: embedding an action in the middle of a grammar rule, as opposed to placing it at the end of one.

We've discussed in class and in previous assignment write-ups how to embed actions at the end of a grammar rule in a CUP script. Actions may also be embedded in the middle of rules. For example, consider the following brief CUP script (with irrelevant parts left out):

    Goal ::=
        Happies {: System.out.println("No more happies!"); :}
        Monkies {: System.out.println("No more monkies!"); :}
    ;

    Happies ::=
        Happies HAPPY {: System.out.println("Happy!"); :}
    |   HAPPY         {: System.out.println("Happy!"); :}
    ;
    
    Monkies ::=
        Monkies MONKIE {: System.out.println("Monkie!"); :}
    |   MONKIE         {: System.out.println("Monkie!"); :}
    ;

Notice that there is only one rule for Goal: Happies followed by Monkies. The action after Happies is within the rule. (You can discern this from the script, since there is no '|' character between Happies and Monkies.) So, the grammar accepts any input file with one or more HAPPY tokens followed by one or more MONKIE tokens.

Actions in the middle of a CUP rule are executed after the preceding portion of the rule has been matched, but before the rest of the rule has been matched. They are equivalent to placing a dummy nonterminal symbol into the middle of the rule, along with the addition of an epsilon rule for the dummy nonterminal symbol; in other words, the Goal rule in the example above is equivalent to these two rules:

    Goal ::=
        Happies Dummy Monkies {: System.out.println("No more monkies!"); :}
    ;
    
    Dummy ::=
        /* epsilon */   {: System.out.println("No more happies!"); :}
    ;

However, the original version, with the action in the midst of the rule, more clearly indicates the intent, which is to execute an action after the Happies, but before the Monkies. All in all, the effect of the example is to do the following:

Print out Happy! for every HAPPY token in the input.
Print out No more happies! after all the HAPPY tokens.
Print out Monkie! for every MONKIE token in the input.
Print out No more monkies! after all the MONKIE tokens.

This is a useful technique for performing some forms of syntax-directed analysis. For example, if we were implementing the semantic checker using a syntax-directed technique, we might want an action embedded within the Procedure rule, after the ParameterList had been matched, but before the BlockStatement. This action would declare the parameters into the SymbolTable before proceeding to analyze the BlockStatement. For some of the work you'll be doing in this assignment, you may find this technique to be of great benefit.

The program

Your program will take a Monkie2004 program as input. Its output will be an indication of a few things:

For each procedure and function, the layout of its activation record (AR). Details will be shown below, but AR's will be laid out similarly to how they were described in lecture, with local variables at the top, parameters at the bottom, and some other relevant information in between.
The layout of global variables into static memory. Rather than report an absolute address for each global variable, we'll assume that there is one contiguous block of memory that will be used for all global variables. Global variable addresses, then, will be reported as offsets into that block.
For each use of a variable in an assignment statement or an expression, information that would allow calculation of an access path to it. This will be described in more detail below.

You will be required to perform your analysis while parsing the program, with actions embedded within your CUP script. You will, of course, need to use auxiliary data structures to store relevant information, such as activation record layouts. But you may not build an abstract syntax tree and then analyze it, as you did in the previous two assignments. Part of what I'd like you to get out of this assignment is experience doing syntax-directed analysis.

Changes to the Monkie2004 language for this assignment

In order to introduce a couple of wrinkles into this assignment and iron out another, two changes have been introduced into the Monkie2004 language for this assignment.

ref parameters have been dropped entirely from the language. The corresponding rule has been removed from the grammar, and all parameters are assumed to be passed by value.
Procedure and function declarations may now be nested, meaning that they may occur as statements in the language. It is assumed that static scoping will be used to resolve references to non-local variables. The rules for calling nested procedures and functions are not relevant to this assignment, but you may assume that standard rules from other programming languages such as Pascal apply.

Memory layout requirements

Data widths

We'll operate under the assumption that all data must be laid out on four-byte boundaries. To accommodate this assumption, we'll make the following rules about the widths of data in Monkie2004:

Integer and boolean variables will occupy four bytes.
Strings, of course, are more complicated, since the lengths of strings may vary widely. We'll say that string variables occupy eight bytes in either global memory or activation records, operating under a couple of assumptions:
- A string variable will require four bytes for its length and four bytes for a pointer to its heap-allocated contents.
- It is assumed that any change to a string (e.g. concatenation, assignment) or copying of a string (e.g. pass by value) will cause an automatic heap allocation of the appropriate size, as well as any necessary deallocation. This behavior will not be modeled in this program.

Global memory

All of the global variables in the program will be allocated into one area of memory called the global area. Global variables should be laid out in the global area in the order seen, with the first one at offset 0, and subsequent ones at higher offsets. For example, if the following three global variables are declared in an input program:

var i: integer;
var s: string;
var b: boolean;

...then they would be laid out as follows:

i would appear at offset 0 of the global area
s would appear at offset 4 of the global area
b would appear at offset 12 of the global area

...and the total size of the global area would be 16 bytes.

Since the lifetime of global variables is the entire duration of the program's execution, no overlaying is ever done to save memory in the global area.

Activation records

Each subprogram, including nested subprograms, has its own activation record. Activation records contain local variables, parameters, and three additional values: a pointer to the caller's AR, a static link, and a return address. Functions have a fourth additional value: a pointer to the return value. Each of these additional values occupies four bytes.

Activation records are assumed to all be stack-allocated, with the stack growing from higher to lower addresses. It is assumed that, during an activation, the current AR pointer will point to the location of the caller's AR in the current activation record. Local variables will appear above the caller's AR, at negative offsets; parameters will appear below the return address and/or return value pointer, at positive offsets. The order of the local variables and parameters will be considered important, and can be determined from the example below.

Consider the following Monkie2004 function:

    function foo(s: string, i: integer, b: boolean): integer
    [
        var ii: integer;
        var ss: string;
        var bb: boolean;
        
        -- ...
    ]

The layout for foo's AR is:

local variable bb at offset -16
local variable ss at offset -12
local variable ii at offset -4
caller's AR pointer at offset 0
static link at offset 4
return address at offset 8
return value pointer at offset 12
parameter s at offset 16
parameter i at offset 24
parameter b at offset 28

...and the size of foo's AR is 48 bytes.

Overlaying local variables in activation records

Activation records should be minimally-sized, meaning that memory within them should be reused whenever possible. The easiest way to ensure that the minimum amount of memory is used is to layout variables that are guaranteed never to live simultaneously into the same offsets of the AR. Given Monkie2004's block structure, and the rule that the lifetime of local variables within a block statement is only within that block statement, this reuse is fairly straightforward to achieve. Consider the following example:

    procedure bar()
    [
        var i: integer;
        var j: integer;
        
        [
            var k: integer;
            
            -- ...
        ]
        
        [
            var m: integer;
            var n: integer;
            
            -- ...
            
            [
                var p: integer;
                
                -- ...
            ]
            
            [
                var q: integer;
                var r: integer;
                
                -- ...
            ]
        ]
        
        -- ...
    ]

The layout of the local variables in the activation record for bar should be as follows:

r at offset -24
p and q at offset -20
n at offset -16
k and m at offset -12
j at offset -8
i at offset -4

This kind of layout can be achieved programmatically using a technique similar to the scoped symbol tables you've used in the previous two assignments.

Static-distance coordinates

Recall that in statically-scoped languages, when accessing non-local variables, the proper way to find them does not involve searching through the call stack looking for the first declaration of a variable with the desired name. (There's another name for this approach: it's called dynamic scoping.) In a statically-scoped language, uses of non-local variables are resolved based on static properties of the program: specifically, its lexical structure. Uses of non-local variables are resolved by finding the syntactically closest declaration for that variable. So, in the following example:

    var i: integer;

    procedure program()
    [
        var i: integer;
        
        procedure foo()
        [
            i <- i + 1;
        ]
        
        procedure bar()
        [
            var i: integer;
            foo();
        ]
        
        bar();
    ]

...the assignment to i in foo should assign the i that is declared in program, not the one declared in bar (foo's caller).

To implement this behavior at run-time, activation records need to store two links to other AR's: one to the caller's AR (often called the dynamic link) and another to the AR of the most recent activation of the lexically-enclosing procedure (often called the static link). In other words, while foo is executing, its dynamic link will point to bar's AR, while its static link will point to the AR for the most recent activation of program. Details of how static links are maintained at run-time are not relevant to this assignment.

Assuming that static links are present in every AR, accessing a non-local variable is a relatively straightforward process. When foo assigns to i, it is known that the static link in foo's AR will always point to the AR for the most recent activation of program. So, finding the address of the appropriate i is a matter of doing two things:

Placing foo's static link into a register.
Adding the offset of i in program's AR to it.

To easily summarize this process, we can say that every use of a variable, in either an assignment or an expression, can be characterized by a static-distance coordinate. The static-distance coordinate (d, o) is an ordered pair containing the distance d (i.e. the number of static links that must be followed to get to a variable) and the offset o (i.e. the offset of that variable in its AR). So, in the example above, the uses of i in foo have a static-distance coordinate of (1, -4), assuming that the offset of i in program's AR is -4.

The static-distance coordinate for a local variable is (0, o), where o is the offset of the variable in the current AR.

It should be pointed out that static distance coordinates do not apply to global variables. Any access to a global variable can be resolved using an address known at compile time, i.e. an offset into the global area. So there's no need to follow static links in order to access global variables; they can be accessed much more quickly by using their statically-determined address.

Sample input and output

As stated earlier, your program will calculate and display three kinds of information while parsing an input program:

The layout and size of the activation record for each procedure and function.
The layout and size of the global area.
The static-distance coordinate or global area offset of every assignment to, or use of, every variable.

Since I'm asking you to calculate and display the information while parsing the program, certain limitations on the order of your output are implied. For example, you won't be able to report the layout and size of an AR until after you've finished parsing its procedure or function. You won't be able to report the layout and size of the global area until you've finished parsing the input program. Static-distance coordinates, on the other hand, may be reported immediately, since all variables are declared before they are used.

With these facts in mind, here is an example Monkie2004 program and a sample of what your output should look like:

Sample input

var calls: integer;

procedure program()
[
    function factorial(n: integer): integer
    [
        calls <- calls + 1;
        
        if n == 0 then
        [
            Result <- 1;
        ]
        else
        [
            Result <- n * factorial(n - 1);
        ]
    ]
    
    calls <- calls + 1;
    
    var i: integer;
    i <- 1;

    var j: integer;
    j <- 0;
    
    while i < 10 do
    [
        print_integer(factorial(i));
        print_endline();
    ]
    
    print_integer(calls);
    print_endline();
]

Sample output

procedure program
[
    function factorial
    [
        assignment to 'calls' @ global area offset 0
        use of 'calls' @ global area offset 0
        use of 'n' @ static-distance coordinate (0, 16)
        assignment to 'Result' @ static-distance coordinate (0, 12)
        assignment to 'Result' @ static-distance coordinate (0, 12)
        use of 'n' @ static-distance coordinate (0, 16)
        use of 'n' @ static-distance coordinate (0, 16)
    ]
    factorial AR layout (size = 20)
    [
        offset 0 - caller's ARP
        offset 4 - static link
        offset 8 - return address
        offset 12 - return value pointer
        offset 16 - parameter 'n'
    ]
    assignment to 'calls' @ global area offset 0
    use of 'calls' @ global area offset 0
    assignment to 'i' @ static-distance coordinate (0, -4)
    assignment to 'j' @ static-distance coordinate (0, -8)
    use of 'i' @ static-distance coordinate (0, -4)
    use of 'i' @ static-distance coordinate (0, -4)
    use of 'calls' @ global area offset 0
]
program AR layout (size = 20)
[
    offset -8 - local variable 'j'
    offset -4 - local variable 'i'
    offset 0 - caller's ARP
    offset 4 - static link
    offset 8 - return address
]
global area layout (size = 4)
[
    offset 0 - 'calls'
]

Sample input #2

Here's a second example, presented because it shows an example of overlaying of variables in an activation record, accompanied by sample output that shows how your output should reflect it:

procedure program()
[
    var i: integer;
    i <- read_integer();
    
    if i /= 0 then
    [
        var j: boolean;
        j <- read_boolean();
        
        print_string("the input was ");
        print_boolean(j);
        print_endline();
    ]
    else
    [
        var k: integer;
        k <- read_integer();
        
        print_string("the input was ");
        print_integer(k);
        print_endline();
    ]
]

Sample output #2

procedure program
[
    assignment to 'i' @ static-distance coordinate (0, -4)
    use of 'i' @ static-distance coordinate (0, -4)
    assignment to 'j' @ static-distance coordinate (0, -8)
    use of 'j' @ static-distance coordinate (0, -8)
    assignment to 'k' @ static-distance coordinate (0, -8)
    use of 'k' @ static-distance coordinate (0, -8)
]
program's AR layout (size = 20)
[
    offset -8 - overlay
    [
        local variable 'j'
        local variable 'k'
    ]
    offset -4 - local variable 'i'
    offset 0 - caller's ARP
    offset 4 - static link
    offset 8 - return address
]
global area layout (size = 0)
[
]

What about erroneous Monkie2004 programs?

You may assume that only legal Monkie2004 programs will be used as input to your program. Bear in mind that a couple of changes have been made to the language, as described above, so the notion of a "legal Monkie2004 program" has changed to include those with nested subprograms and to exclude those with pass-by-reference parameters.

Starting point

Because it's neither necessary nor acceptable to use an AST to solve this problem, the starting point is not your code from the previous two assignments. I'm providing a starting point, which consists of only a complete scanner, a CUP script (monkie.cup) with all of the actions removed from it, and a Driver class that sets things up and runs the program for you. As usual, the starting point is provided as a Zip archive:

Zip archive

Deliverables

Place your completed CUP script and all of the .java files that comprise your program into a Zip archive, then submit that Zip archive. You need not include the .java files created by CUP (Parser.java and Tokens.java), but we won't penalize you if you do. However, you should be aware that we'll be regenerating these ourselves during the grading process, to be sure that they really did come from your CUP script. Please don't include other files, such as .class files, in your Zip archive.

Follow this link for a discussion of how to submit your assignment. Remember that we do not accept paper submissions of your assignments, nor do we accept them via email under any circumstances.

In order to keep the grading process relatively simple, we require that you keep your program designed in such a way as it can be compiled and executed with the following set of commands:

    cup monkie.cup
    javac *.java
    java Driver inputfile.m