CompSci 141 / CSE 141 / Informatics 101 Spring 2013, Project #1

Introduction

As you begin to formulate an understanding of programming languages this quarter, one of the first things you'll need to learn is how languages are described. This is a vital skill not just in a course such as this one, but going forward, as well; whether you're learning a new language, becoming more proficient in a language with which you're already familiar, or designing a new language, you must understand how programming languages are described, so that you can read or write a clear description of the language expressed in a way that is familiar to others.

The description of a programming language is generally broken into two main parts, each of which answers an important question about the language.

Syntax. What do constructs in the language look like?
Semantics. What do constructs in the language do?

There is broad agreement amongst programming language designers about the appropriate tools for describing syntax: BNF (Backus-Naur Form) or context-free grammars are often used. (These two notations are conceptually identical, so I'll refer to them collectively as grammars, as I did in lecture.) Semantics, on the other hand, have proven to be much more difficult to describe formally; while there are formal methods for describing semantics, there is not one method that is a de facto standard, and semantics are most commonly explained in a natural language such as English. The downside of using a natural language is that the imprecision inherent in natural languages can leave room in the specification for different interpretations, which can lead to different implementations of the language behaving differently because of the different assumptions made by each implementer; however, the tradeoff is the difficulty in writing complete, correct formal semantics, and that's a trade that's not often made in practice.

As we discussed in class, even a formal notation like grammars can lead to imprecision if not used carefully. Since language processors — compilers and interpreters represent the two extremes, with various kinds of hybrid approaches in between — use grammars not only to determine whether programs are syntactically correct, but also to infer at least some aspects of their meaning, it is important to construct grammars carefully. In particular, the problem of ambiguity is best avoided whenever possible. Recall that an ambiguous grammar is one that, for at least one sentence in the language of the grammar, allows more than one parse tree or leftmost derivation to be constructed. Since the meaning of a program is partly inferred from the structure of a parse tree or derivation, it is worthwhile to ensure that the grammar is unambiguous. After all, a program with multiple possible parse trees is a program with multiple possible meanings; we'd like it to be the case that any given program in a programming language has only one meaning.

This project will deepen your understanding of grammars by asking you to construct a complete grammar for a relatively simple programming language. Along the way, you'll explore how to avoid ambiguity, and learn more about precedence and associativity rules, including how to use grammars to specify them.

The FUNCyMonkie language

A FUNCyMonkie program is made up of one or more functions. Each function is comprised of a name, a list of one or more parameters, an equals sign, and one expression. A function's name and its parameters are identifiers, which are sequences of one or more upper- and/or lowercase letters that begin with a lowercase letter. When a function is called, values are passed into each of the parameters, then the expression is evaluated in terms of those values. The result of the expression becomes the result of the function.

Here are some examples of functions. (These look like assignment statements from Java, but they're actually more akin to method declarations.)

    -- The function inc takes one parameter, x, and returns the value x + 1.
    inc x = x + 1

    -- The function h takes four parameters -- x, y, z, and w -- and returns
    -- one of two results, depending on whether the result of calling f and
    -- passing it x and y is true or false.  If it's true, the result of
    -- adding z and w is returned; otherwise, w is returned.  (Notice how
    -- no parentheses or commas are used in the call to f; we'll come back
    -- to this later.)
    h x y z w = if f x y
                then z + w
                else w
                endif

    -- The function square takes one parameter n and squares it, returning
    -- the result.
    square n = n ^ 2

Comments are denoted by two dashes; after two dashes, the remainder of a line is considered to be a comment.

The offside rule

One difference that you'll likely notice between FUNCyMonkie syntax and the syntax of languages like Java or C++ is that there are no explicit characters that separate one function from the next; no semicolons, curly braces, or other characters are used to denote program structure. You may wonder, then, how a FUNCyMonkie language processor would be able to tell when one function ends and another begins.

The answer to this problem lies in a rule that is called the offside rule. If the first character in a function definition appears at a particular horizontal position on a line, the next line whose first character appears at the same position or an earlier position is considered to be a new function. As an example, consider this unplesant-looking, but nonetheless syntactically correct, layout of the inc and square functions from the previous code example:

    inc x =           -- beginning of the inc function
       x +            -- the "inc" function continues
      1               -- still in the "inc" function
    
   square n = n ^ 2   -- a new function!

The square function is distinguished from the inc function because it begins on a line whose first character is one position earlier than the first character on the line that begins inc. By way of contrast, consider this layout of the same code:

    inc x = x + 1          -- beginning of the inc function
       square n = n ^ 2    -- a syntax error, since this is interpreted to be part of the "inc" function

(This may seem like a strange way to design the syntax of a programming language, but it has its advantages, and there are well-known programming languages, such as Python and Haskell, that use some form of this rule as part of their syntaxes.)

Executing a FUNCyMonkie program

You won't be writing or executing FUNCyMonkie programs in this assignment, but for a little more background understanding of the language, you should know how they're run.

FUNCyMonkie programs are executed in an interactive, interpreted way, with the interpreter accepting a sequence of expressions from the user, evaluating each of them after it is entered, then printing the result. An example session with a FUNCyMonkie interpreter might look like this, with =>> representing the prompt that the interpreter prints to ask the user to enter an expression. (Note that the example below is not a FUNCyMonkie program; it's an example session of a user interacting with a FUNCyMonkie interpreter.)

    =>>  inc 3
    4
    
    =>>  square 9.5
    90.25

Expressions in FUNCyMonkie

There are two kinds of expressions in FUNCyMonkie: simple expressions and compound expressions.

A simple expression is one of the following:

An integer literal, such as 5 or 13. All integer literals in FUNCyMonkie are non-negative.
A real literal, such as 5.5 or 456.789. Like integer literals, all real literals are also non-negative.
A boolean literal. There are two possible values for a boolean literal: True and False.
An identifier, such as monkie or alex.

A compound expression is one of the following:

A relational expression, which compares the values of two expressions for equality and/or ordering. There are six relational operators: ==, /=, <, <=, >, and >=. Examples:

    3 == 4                   -- result: False
    True /= False            -- result: True
    x < y                    -- result: True if x is less than y, False otherwise

An arithmetic expression, which combines the values of two expressions using one of these arithmetic operators: +, -, *, /, ^ (exponentiation). Examples:
```
    3 + 4                    -- result: 7
    7 / 2                    -- result: 3.5
    2 ^ 3                    -- result: 8
```
A conditional expression, which evaluates a test expression and then evaluates one of two other expressions based on the value of the test. Examples:
```
    if x < y then 4 else 5 endif            -- result: if x is less than y, result is 4, otherwise 5
    if 3 == 4 then x + y else 7 / 2 endif   -- result: 3.5
```
Any arbitrary expression can be used as the test, placed in the then clause, or placed in the else clause. Both the then and else clauses are mandatory, as is the endif that terminates the expression.

A function application, which calls a function, passing the result of one or more expressions as arguments. Syntactically, function applications in FUNCyMonkie are a bit different than the method calls or function calls you might be accustomed to from other languages. The function name, an identifier, comes first, followed by its arguments, separated only by whitespace. There are no mandatory parentheses, commas, or other "punctuation." Examples:
```
    f x                      -- calls the function f and passes the value of x as an argument
    g 3 4                    -- calls the function g and passes 3 and 4 as arguments
```
Simple expressions can be passed as arguments as in the examples above. Compound expressions can only be passed as arguments if they are surrounded by parentheses. Examples:
```
    f (3 + 4)                -- calls the function f and passes 7 as its argument
    f 3 + 4                  -- calls the function f and passes 3 as its argument...
                             -- ...4 is then added to the result of the function
```
The latter example demonstrates that function application binds more tightly than other operators. In other words, the expression is assumed to be, primarily, an addition operation, with f 3 being its left operand and 4 being its right.

Compound expressions can be combined together, with precedence and associativity rules used to determine their meaning. Parentheses can be used to override the precedence and associativity rules. Examples:

    3 + 4 * 2                -- * has higher precedence than +, so result is 11
    (3 + 4) * 2              -- result is 14
    9 - 3 - 2                -- the - operator is left-associative, so result is 4
    9 - (3 - 2)              -- result is 8

The precedence and associativity rules of the operators are summarized in the following table, with operators listed on the top row having the highest precedence, operators listed on the second row having the next-highest precedence, and so on.

Operators	Associativity
^	right-associative
* /	left-associative
+ −	left-associative
== /= < <= > >=	non-associative

Function application has a higher "precedence" than any of these.

Writing a grammar for FUNCyMonkie

Design and write a grammar that accepts complete FUNCyMonkie programs, as described above. The grammar must be unambiguous and must be specified in the BNF-like style discussed in lecture. (You may not use the Extended BNF shortcuts described in the textbook.)

Your grammar must respect precedence and associativty, meaning, for example, that operators with higher precedences should be forced to appear lower in a parse tree than operators with lower precedences.

You may assume the presence of a "scanner" that takes an input file and turns it into a sequence of tokens, which would then need to be matched against your grammar. The alphabet of your grammar — the set of possible tokens — should include the following special tokens, in addition to the literal ones such as parentheses and the operators:

IntLiteral, which represents an integer literal.
RealLiteral, which represents a real literal.
Identifier, which represents an identifier.
True and False, which represent the boolean literals.
Offside, which indicates the occurrence of an "offside" condition, as described above.

As an example, consider this FUNCyMonkie program:

    inc x = x + 1
    circleArea radius = radius ^ 2 * 3.14

It would be scanned and turned into the following token sequence:

Identifier, Identifier, =, Identifier, +, IntLiteral, Offside, Identifier, Identifier, =, Identifier, ^, IntLiteral, *, RealLiteral

Note, in particular, the presence of the Offside token separating the tokens that make up the inc function from those comprising the circleArea function.

Be sure that it's clear in your grammar which symbols are terminal symbols and which are non-terminal symbols. Use boldface for terminal symbols and italics for non-terminal symbols, or some other format if you prefer. Please specify at the top of your document what format you've chosen.

How to test your grammar

As you work on your grammar, you may wonder how you can be sure whether your work is correct. I suggest working on your grammar in stages — implementing a language feature at a time — and developing test cases, as you would when you write code. Try out your test cases often, assessing what parse tree would result (or, in cases that aren't legal in FUNCyMonkie, checking that building a parse tree is not possible). If you get results other than you were expecting, you've got a problem in your grammar; if not, move on to another feature and implement it.