ICS 142 Winter 2004, Assignment #1

Due date and time: Monday, January 26, 11:59pm

Introduction

The first step of compilation is scanning (sometimes called lexical analysis). As we discussed in class, the objective of scanning is to take the source program, which is in reality a stream of characters, and turn it into a stream of lexemes, or words, instead. This raises the level of abstraction of the source program at a very low cost, since scanning can be done very efficiently. The time spent on scanning pays off many times over, as the next step of compilation, parsing, has a much higher cost per symbol than scanning does. The fewer symbols a parser deals with, the less time and memory is used, so a compiler is much more efficient if its parser deals with lexemes rather than characters.

In this assignment, you'll write a scanner for a hypothetical language called Monkie2004. You'll be using a Java-based scanner generator called JFlex for the majority of your solution to this assignment.

JFlex: a scanner generator for Java

JFlex is a scanner generator, which takes as input a script that uses regular expressions to describe a set of lexeme patterns, then generates a Java class implementing a scanner that recognizes those patterns. Associated with each pattern is an action, which is a chunk of Java code; each time the scanner recognizes a pattern, the associated action is executed. JFlex generates an efficient scanner, saving you the trouble of designing and implementing a DFA by hand or, worse yet, hand-coding the scanner from scratch.

Understanding JFlex requires a rough understanding of the class that it will generate, as well as an awareness of how various settings can be modified so that the class will behave as you'd like it to. The following sections seek to give you this understanding.

The general structure of a JFlex script

JFlex scripts follow this general structure:

User code

%%

Options and declarations

%%

Lexical rules

The three sections of the script are separated by %% appearing by itself on a line.

The user code section contains code that appears outside of the scanner class; most typically, this will contain import or package declarations, but could include an additional class if you'd like. (Personally, I prefer putting additional classes into separate .java files. Remember that all of the code related to your scanner doesn't have to go into the JFlex script. The purpose of the JFlex script is just to generate the class that represents the scanner; I look at it as a replacement for the one Java class that is generated from it.) The code in this section will appear above the generated scanner class in the .java file generated by JFlex.

The options and declarations section contains three things:

The specification of various options, such as the name of the scanner class, the return type and name of its "next token" method, etc. A full list of options is listed in the JFlex user manual. You shouldn't need any options other than the ones I've specified in the example below.
A declaration of additional code that is placed into the scanner class. If you need to add fields or helper methods into the scanner class, this is where you put them.
Declarations of regular expression macros, which establish shorthands for commonly-appearing subexpressions, or complex subexpressions that will make more sense if given a name.

Here's a brief example of an options and declarations section:

%class MyScanner        // sets the name of the scanner class to MyScanner
%public                 // makes the scanner class public
%type Token             // sets the return type of the scanner method to Token
%function getNextToken  // sets the name of the scanner method to getNextToken

%{
    // This code is copied into the scanner class verbatim.  Use it to add
    // additional fields and methods to the scanner class, if necessary.

    private int anExtraField;
    
    public int aHelperMethod()
    {
        return anExtraField;
    }
%}

// These are macros for regular expressions, which can be used in the next
// section for building bigger regular expressions.  Why you would want to
// use these is basically the same reason why you want to have named constants
// in any program: for readability and writability!

LineTerminator = \r | \n | \r\n
Whitespace = {LineTerminator} | [ \t\f]

The example above obviously isn't precisely what you want for this assignment, but gives you an idea of what this section is used for.

The lexical rules section is the heart of the scanner. In this section, you specify a regular expression that describes a type of lexeme that you'd like to recognize, along with an action that should be executed whenever a matching lexeme is found. Most of the time, you'll want your actions to contain a return statement in them; that statement will return a value from the scanner method. (In the case of patterns that indicate text that should be ignored, such as whitespace and comments, you may want the action to be empty instead.) It stands to reason, then, that the type of value you return in your action should be compatible with the return type you specified in the options section.

Here are a few examples of lexical rules:

"public"               { return new Token(Token.PUBLIC, yytext()); }
"protected"            { return new Token(Token.PROTECTED, yytext()); }
"private"              { return new Token(Token.PRIVATE, yytext()); }
[a-z]+                 { return new Token(Token.IDENTIFIER, yytext()); }

In the examples above, the action associated with each rule specifies the return of a Token, which is a hypothetical object that represents one token in a programming language. One way to design a scanner is to have it return token/lexeme pairs, which is what this example does; each Token is constructed from a token type (which is assumed to be an int constant defined in the Token class, such as Token.IDENTIFIER) and the lexeme that was recognized. The recognized lexeme can be accessed by calling the yytext() method, as above.

Regular expressions, JFlex-style

The syntax for regular expressions used by JFlex is very similar to the syntax used by many other tools that support regular expressions, such as lex, flex, and grep on Unix and Linux, as well as the scripting language Perl, and also the regular expression package in the Java library. A complete specification can be found in the JFlex user manual, but here's a quick list of the ones you'll probably find to be most useful:

A single character expression, such as x, matches that character.
A character class specifies a range of characters in roughly the way you'd expect. For example, [a-z] specifies all lowercase letters, while [a-zA-Z0-9] specifies all uppercase letters, lowercase letters, and digits.
Character classes can be negated by starting them with a ^ character, but you should be careful with this feature. The character class [^a-x] specifies all characters in the entire character set that are not 'a'..'x', not just 'y' and 'z'.
A string expression, such as "Alex", matches exactly that string.
The alternation ("or") operation is specified by the | character, just as we discussed in class. So, the expression a | b matches either an "a" or a "b".
The concatenation operation is specified by juxtaposing two expressions, just as we discussed in class. Thus, the expression (a | b) [a-z] matches a string that begins with "a" or a "b", followed by any lowercase letter.
The Kleene closure (or "star") operation is specified by the * character, just as we discussed in class. For example, the expression [a-z][a-z]* specifies strings of lowercase letters with at least one letter in them.
As a shorthand for "one or more", the + operator can be used. The expression [a-z]+ is equivalent to [a-z][a-z]*.
As a shorthand for "zero or one", the ? operator can be used. The expression [a-z][a-z]? is equivalent to [a-z] | [a-z][a-z].
Macros are used by surrounding them with curly braces. So, the expression {Whitespace}+ would match one or more characters of whitespace, assuming an appropriately-defined macro called Whitespace.
The . character (appearing outside of a string) matches any character in the character set except a newline. So, the expression a.* matches all strings that begin with a lowercase "a".

There are additional operators, but none that will matter too much for this assignment.

What JFlex does with a script

Given an input script, JFlex generates one corresponding .java file that is primarily made up of a class that implements your scanner. By default, the class has the name Lexer, though you can change its name to anything you want by specifying a value for the %class option in your script. I also advise making the class public, by specifying the %public option as well.

The heart of the scanner class is a method that returns the next token from the input stream. You can control the specification of this method, including its name and return type, by specifying options. Here, again, is an example of the relevant options:

    %type MonkieToken
    %function getNextToken

...which will give the "next token" method the following signature:

    public MonkieToken getNextToken() throws java.io.IOException

You can also control other aspects of this method's signature, including what exceptions it throws and what it does when the end of the input has been reached. If you're interested in more details about that, refer to the JFlex user manual.

The primary reason why you need this kind of control over the signature of the scanning method is that a scanner is just a small piece of a compiler. Its central job is to return the next token in the input whenever asked. The next stage of compilation, the parser, will continually ask the scanner for the next token as it does its work. Making sure that the scanner and parser can integrate together correctly is vitally important, especially if you're using an automated parser generator tool (as we'll be doing later in the quarter).

The scanning method will work its way through the input from wherever it left off previously, looking for an occurrence of one of the patterns that you specified. It will always match the longest occurrence of one of your patterns that it can find. For example, if you've specified these two patterns:

    for        // the "for" keyword
    [a-z]+     // one or more lowercase letters

...if the remaining input begins with format followed by whitespace, JFlex will match format to the second pattern, even though it could have matched for to the first pattern (with mat left over). The reason for this is simple: it's how we expect our programming languages to behave. For example, if we wrote the following statement in a Java program:

    returnx;

...we shouldn't be the slightest bit surprised when we get a compiler error that says something like "Hey, you never declared a variable called returnx!" We would be crazy to expect that the compiler would discover that this statement could really be interpreted to mean "return x". (Some older programming languages, most notably Fortran, treated whitespace as optional in this way, but such rules are now considered to cause more problems than they solve.)

The order in which you specify your patterns also turns out to be important. Some strings will match more than one pattern. With the two patterns above, for and [a-z]+, the string for actually matches both of them. JFlex resolves such a conflict by always choosing the first pattern you specified. So, if you specified your patterns in this order:

    [a-z]+
    for

...the string for would be considered an identifier, rather than the for keyword! Exercise caution when you're ordering your patterns.

Believe it or not, much of the scanning theory that we've discussed in class (and appears in Chapter 2 of your textbook) is embodied within JFlex. A quick tour through the JFlex source code reveals classes like RegExp, DFA, and NFA, and implementations of algorithms for translating a regular expression to an NFA and an NFA to a DFA, including methods that do necessary jobs like adding states and transitions, as well as calculating ε-closures of NFA states.

Creating and using a scanner object

JFlex creates a scanner class with two constructors, one that takes a java.io.Reader as a parameter, and another that takes a java.io.InputStream as a parameter. So, it's easy to create a scanner that reads its input from either a file or System.in by calling the appropriate constructor. Here are two examples of creating a scanner object, assuming a scanner class called MyScanner:

    // creates a scanner that reads its input from a file called "filename"
    MyScanner s1 = new MyScanner(new FileReader("filename"));

    // creates a scanner that reads its input from System.in
    MyScanner s2 = new MyScanner(System.in);

Once your scanner object has been created, you can repeatedly call its scanning method until the end of the input is reached. While there are a few ways to configure your scanner to handle the end of the input, I suggest leaving the default behavior of returning null (assuming that your scanning method returns an object type, such as the class Token that I used in my example) when there is no more input.

I want to know more!

Of course, there's plenty more documentation available about JFlex, if you'd like to read through it. The complete user manual can be found at this link. A copy of it (in a file called manual.html) is also present in the prepared installation that I've provided. The manual also references some academic papers and reference books, if you're really interested in the subject and want more depth.

It can also be educational (and fun, if you like this sort of thing) to look through the code that JFlex generates for your script. I won't claim to have gone through it with a fine-tooth comb, but I have paged through it to get its general flavor, and I've discovered that it's surprisingly clean, makes basically good sense, and even includes comments. For example, I noticed that it uses techniques for table compression like the ones described in section 2.5 of your textbook, to reduce the memory requirement. Good stuff. Looking through it also helps you to understand where the snippets of Java code from the JFlex script actually go, which can come in handy when you're writing your JFlex script.

Using JFlex in the ICS labs

JFlex is already installed on the Windows workstations in the ICS labs. JFlex is primarily driven via the command line, so to use it, start by bringing up a Command Prompt window. On the lab machines, JFlex is not configured to work by default. Each time you start a Command Prompt, you'll need to load its settings by executing these two commands:

cd \opt\ics142
start142

Once this has been done, you can compile a JFlex script (e.g. blah.flex) with the command jflex blah.flex. This will create a Java source file (e.g. Lexer.java). Don't forget to compile this .java file before executing your program!

(Note: On a few of the machines in the lab, the start142.bat file might be called start.bat instead. This is an installation bug that hasn't been eradicated fully, since all the machines aren't always on when it's time to update them. In that case, execute this command instead:

.\start

The dot and the backslash are necessary because start is actually a Windows shell command.)

Installing JFlex at home

In case you'd prefer to do some or all of your work at home, I've prepared an installation of JFlex and CUP (a parser generator tool that we'll be using later in the quarter) for Windows in an easy-to-use Zip archive that's basically the same as the one provided in lab, with a few minor differences. Assuming that you've already installed Java on your system, installation will be a snap!

Expand the Zip archive to whatever folder you're comfortable with. In this folder will be a batch file called start142.bat that (temporarily) sets up JFlex (and CUP) for your use. You'll need to make one change to it before you can use it, though; open the file in your text editor and set the JFLEX_CUP_HOME environment variable to the folder where you extracted the archive. (There's more information about this within the start142.bat file.)

There's one more caveat that I should mention about my prepared installation. It assumes the presence of an environment variable called JAVA_HOME, which contains the path to the base folder in which you installed the Java SDK (e.g. c:\j2sdk1.4.2_03). If you don't have this environment variable already set up on your system, you should set it up globally; many existing Java programs and tools use it for installation and configuration. If you don't remember how to change the value of environment variables, refer to this document about setting up and installing Java.

Once you've got the installation finished, bring up a Command Prompt, change into the folder where you extracted the archive and type the command start142. You should now be ready to use JFlex! Be aware that, just like in the labs, you'll need to run the start142 command every time you bring up a new Command Prompt, as its settings won't "stick."

After the settings are loaded, you can compile a JFlex script (e.g. blah.flex) with the command jflex blah.flex. This will create a Java source file (e.g. Lexer.java). Don't forget to compile this .java file before executing your program!

Of course, if you'd prefer to install JFlex yourself, it's not especially difficult to do if you have some experience installing Java-based programs and you follow the instructions in the documentation. If you'd like to download the normal JFlex distribution, it can be obtained free of charge from jflex.de. Stick to the current "stable" version 1.3.5. Users of Linux, Mac OS X, and other operating systems: you're unfortunately on your own, as I have nothing but Windows machines at home and in my office.

Great, but what am I supposed to do with all this?

Your assignment is to use JFlex to write a scanner for a hypothetical programming language called Monkie2004. A scanner isn't normally a stand-alone component; it is generally integrated into a program that processes input text for some greater purpose, such as compilation or interpretation, or even the finding and replacing of text. For the time being, however, I'd like your scanner to stand alone, meaning that it will need to be surrounded by a driver program that creates a scanner object, breaks an input file (whose name is specified as a command-line parameter) into token/lexeme pairs, and prints the token/lexeme pairs to System.out as it goes along. Only lexemes that matter should be printed, not those which are evident from the token type. This roughly means that you should print the lexeme associated with tokens like identifiers and string literals, but not keywords or operators.

You might be tempted to place the printing of this output into the scanner itself by embedding it into the action associated with each lexical rule. I'm requiring that your scanner return the token/lexeme pair back to the caller (the driver program) and that the caller does this printing. The reason for this is clear from a software engineering perspective: a properly-designed scanner should be functional in many contexts, meaning that it should break the input into its tokens and lexemes, then return the token/lexeme pairs to the caller for further processing, be it parsing, printing, or whatever else. If we were to continue with this Monkie2004 example by building a parser in a subsequent assignment, you would be able to use your scanner with your parser without any modification.

To make it easier to grade the assignments, I have a couple of requirements about naming:

Your JFlex script must appear in a file called monkie.flex.
Your driver program must have its main() method in a public class called Driver, appearing in a file called Driver.java.
You may include as many other .java files as you'd like.
None of your files should use the package directive.

The effect of these requirements is that we'll be able to compile and run your program using the following commands:

    jflex monkie.flex
    javac *.java
    java Driver inputfile.monkie

...where inputfile.monkie is the name of a file containing a Monkie2004 program.

The Monkie2004 language

The Monkie2004 language is a simple imperative-style language with two basic data types (integers and strings), functions with zero or more arguments and one return type, procedures with zero or more arguments and no return type, reasonably straightforward mathematical, relational, and logical operations, and a few kinds of imperative-style statements (e.g. assignment statements, an "if" statement, and one kind of loop). Of course, to build a scanner, it's not necessary to know the complete specification of a programming language (though, in some cases, it helps), so I will stick to a brief example program and a lexical specification (i.e. what each kind of token looks like, without regard for how these tokens are put together to write a Monkie2004 program).

An example program

Here's an example Monkie2004 program, deliberately written with somewhat more verbosity than is necessary, in order to demonstrate more language features:

-- the square function returns the square of some integer
function square(n: integer): integer
[
    if n == 0 then
        Result <- 0        -- the Result "variable" holds the function's
                           -- return value
    else
    [
        var x: integer;    -- declares a variable x
        x <- n * n;
        Result <- x
    ]
]


-- the program procedure is where the program begins its execution,
-- somewhat akin to the main() method in Java
procedure program()
[
    var a, b: integer;
    a <- square(3);
    b <- square(a);
    call print(a);
    var s: string;
    s <- " squared is ";
    call print(s);
    call print(b)
]

Lexical specification of Monkie2004

A Monkie2004 program is made up of a sequence of lexemes, each of which can be classified as being of one particular token type. Unless otherwise specified, the lexeme that should be returned by your scanner is the actual chunk of source text, such as 35 or <-. The types of tokens, and the rules that govern their appearance, are as follows:

Identifiers must begin with an uppercase or lowercase letter, followed (optionally) by any number of uppercase letters, lowercase letters, digits, or underscores ('_'). An identifier may not end with an underscore, nor may it contain two or more consecutive underscores.
An integer literal is a sequence of digits, optionally preceded by a '-' character (indicating a negative number).
A string literal is a sequence of characters within double quotes. Unlike in languages like Java, there are no special escape sequences such as \n. There are a couple of rules that need to be specified about string literals, however:
- An end of line may not occur within a string literal. This is always an error. (You need not handle this error in your scanner, but you may if you wish.)
- If you want to put a double quote character into a string literal, you can do so by putting two double quotes in a row. For example, the string literal "My favorite song is ""Where the Streets Have No Name"" today" is a legal Monkie2004 string literal, which represents the following string of text: My favorite song is "Where the Streets Have No Name" today.
It should be noted that the contents of a string literal should be returned as its lexeme. So, in the example above, the lexeme should be My favorite song is "Where the Streets Have No Name" today, without the beginning and ending double quotes, and with the pairs of double quotes within compressed to one each.
The following keywords are reserved in Monkie2004. Each should be recognized as a separate token type:
- and, call, do, else, function, if, implies, not, or, procedure, ref, then, var, while, xor
The following are characters or character sequences with special meaning, such as operators and parentheses. Each should be recognized as a separate token type:
- Left and right parentheses: ( )
- Left and right brackets: [ ]
- Arithmetic operators: + - * /
- Relational operators: < <= > >= == /=
- The assignment opearator: <-
- String concatenation: &
- Semicolon: ;
- Colon: :
- Comma: ,
Comments begin with two dashes and continue until the end of the line on which they appear. Comments should be ignored by the scanner, meaning that they don't constitute a lexeme.
Whitespace in a Monkie2004 program should be ignored; it does not constitute a lexeme.

Sample input and output

Sample input:

-- the semi-infamous "hello 9" program
procedure program()
[
    var a: integer;
    a <- 3 * 3;
    call print("hello ");   -- notice how comments don't show up in the output
    call print(a)
]

Sample output:

PROCEDURE
IDENTIFIER (program)
LEFT_PAREN
RIGHT_PAREN
LEFT_BRACKET
VAR
IDENTIFIER (a)
COLON
IDENTIFIER (integer)
SEMICOLON
IDENTIFIER (a)
ASSIGNMENT_OP
INTEGER_LITERAL (3)
MULTIPLICATION_OP
INTEGER_LITERAL (3)
SEMICOLON
CALL
IDENTIFIER (print)
LEFT_PAREN
STRING_LITERAL (hello )
RIGHT_PAREN
SEMICOLON
CALL
IDENTIFIER (print)
LEFT_PAREN
IDENTIFIER (a)
RIGHT_PAREN
RIGHT_BRACKET

Your program's output need not be exactly like mine, but it should convey the same message; on each line, it should show one token and its associated lexeme.

Naturally, it's necessary for your program to work with more than just the sample input. We'll be testing it with several input programs that, together, test all of the lexical specification as described in the previous section. You need not provide error handling; you may assume that the input file consists of a legal Monkie2004 program.

Deliverables

Submit your JFlex script and the .java files that comprise your driver program. You need not submit the .java file created by JFlex; we'll be regenerating it ourselves during the grading process anyway. Remember that the name of your JFlex script must be monkie.flex, and that your driver program's main() method must appear in a file called Driver.java!

Follow this link for a discussion of how to submit your assignment. Remember that we do not accept paper submissions of your assignments, nor do we accept them via email under any circumstances.

Limitations

There are a few JFlex features that I would prefer that you didn't use in your assignment. If you don't know what the features on this list are, there's no need for you to find out about them unless you're curious.

You may not use yybegin, yystate, or lexical states in your set of lexical rules.
You may not use yypushback.
You may not use lookahead (i.e. trailing context).
You may not use the pre-existing shorthand character classes, such as [:letter:], [:uppercase:], or [:digit:].
You may use line, column, and character counting if you wish, though it does not serve any purpose in this assignment.