(Last modified Thu Jun 05 23:53 2008)

home

Software Formalisms
Grammars and Languages

A grammar consists of terminals, nonterminals, and rules.  Each grammar has a nonterminal that is its start symbol.  A grammar defines a language (a set of strings). 

Each grammar corresponds to a language, the set of strings it generates or recognizes.  A grammar can be used to generate the strings of its language (going from its start symbol to strings of nonterminals), or to parse strings of its language (going from strings of nonterminals to grammatical structures of those strings). 

There are several kinds of grammars, among which are: 

The example will be the context-free language of propositional logic.  Here is its grammar: 

formula::=T "True" is a formula
 |F "False" is a formula
 |variable A variable is a formula
 |negation The negation of a formula is a formula
 |conjunction The conjunction of two formulae is a formula
 |disjunction The disjunction of two formulae is a formula
variable::=A The letter 'A' is a variable
 |B The letter 'B' is a variable
 |C The letter 'C' is a variable
negation::=¬formula Negate a formula by putting ¬ before it
conjunction::=formulaformula Conjoin two formulae by putting ∧ between them
disjunction::=formulaformula Disjoin two formulae by putting ∨ between them
parenthesization::=(formula) Parenthesize a formula by putting () around it

Its start symbol is formula

Terminal

The nonterminals of a grammar are the symbols that are the alphabet of strings in the language the grammar defines. 

Example:  the characters '(', ')', '¬', '∧', '&or', and the letters A through C. 

Nonterminal

A nonterminal represents a grammatical category in the language. 

Example:  the symbols variable, negation, conjunction, disjunction, parenthesization, and formula

Rule

A rule shows how to replace a nonterminal and (eventually) produce a string in the language. 

Looking at it from the opposite direction, a rule shows how to parse a string in the language and identify its grammatical form. 

Context-free rules

A (context-free) rule consists of a nonterminal (the left-hand side or LHS), followed by "::=", followed by the string of terminals and nonterminals (the right-hand side or RHS). 

Sometimes "→" is used instead of "::=", especially when the rules are being used as productions. 

If there are two or more rules for the same LHS, we often list them successively with '|' for the second or later rules instead of "LHS ::=". 

Example:  Two rules

formula::=T
 |F (Note the | for the second rule for the same LHS)

Using these as productions, we start with the nonterminal formula (the LHS) and replace it first by the first rule's RHS to get "T", and then by the second rule's RHS to get "F".  This gives us a finite set { "T", "F" }. 

Example:  Also consider the rule

formula::=¬formula

This could go on forever, and we've hardly even used rule 2 yet.  These three rules give us an infinite set of formulae { "T", "F", "¬T", "¬F", "¬¬T", "¬¬F", ... "¬¬¬¬F", ... , "¬¬¬¬¬¬¬¬¬T", ... }

AST for !A&B

Figure 1.  Abstract syntax tree for ¬A∧B

We can also use a grammar's rules to parse a string and identify its grammatical structure.  Using the full set of rules at the top of the page, we can parse "¬A∧B" thus: 

  1. "¬A∧B" is the conjunction of "¬A" and "B". 
  2. "¬A" is the negation of "A". 
  3. "A" is a variable. 
  4. "B" is a variable. 

We can't go any further because "A" and "B" are terminals (neither of them is not the LHS of any rule).  "A" terminates its branch of the parse tree (thus "terminal"), and "B" terminates its branch.  The structure is shown graphically in Figure 1 as an abstract syntax tree.  The leaves of the tree are the terminals, and the interior nodes are the syntactic categories. 

Note that we apply the weaker-binding rule first (conjunction), and the stronger-binding rule later (negation). 

AST for !(A&B)

Figure 2.  Abstract syntax tree for ¬(A∧B)

Similarly, we can parse "¬(A∧B)", a similar string with a different grammatical structure: 

  1. "¬(A∧B)" is the negation of "(A∧B)". 
  2. "(A∧B)" is the parenthesization of "A∧B". 
  3. "A∧B" is the conjunction of "A" and "B". 
  4. "A" is a variable. 
  5. "B" is a variable. 

We can of course produce ¬(A∧B) by beginning with the start symbol formula and applying the rules as productions:

Types of grammars

There are four principal classes of formal grammars:  regular, context-free, context-sensitive, and unrestricted.  Each is more powerful than the preceding kinds.  We usually use "grammar" to mean "context-free grammar". 

Regular grammars

The rules of a regular grammar are restricted to contain a single nonterminal as the LHS and one of the following on the RHS:

The regular grammars are equivalent to the regular expressions, and there is an algorithm to translate any regular grammar into the equivalent regular expression (and vice versa). 

Context-free grammars

The rules of a context-free grammar are described above.  Each rule has a single nonterminal as its LHS and a (possibly empty) string of nonterminals and terminals as its RHS. 

Context-sensitive grammars

The rules of a context-sensitive grammar show how a nonterminal is expanded in a specific context.  Each rule has a single nonterminal N surrounded by context α and β as its LHS, so that the LHS is αNβ, and a non-empty string γ of nonterminals and terminals in the same context as its RHS, so that the RHS is αγβ. 

Most definitions of CSGs allow a production whose RHS is the empty string so that the context-free languages are a subset of the context-sensitive languages (see below). 

Unrestricted grammars

Both the LHSs and RHSs of unrestricted grammars may contain any string of nonterminals and terminals. 

The Chomsky hierarchy, and equivalence of grammars and automata

Chomsky hierarchy

Figure 3.  The Chomsky hierarchy

The Chomsky hierarchy is the containment hierarchy of the four classes of languages defined by these types of grammars.  It relates:

Figure 3 shows the containment hierarchy of these sets of languages. 

Each grammar of the four language classes is equivalent to an automaton, and an algorithm exists to convert a grammar to an automaton (and vice versa): 

Grammar typeAutomaton type
Regular grammar Finite state machine
Context-free grammar Pushdown automaton
Context-sensitive grammar Linear bounded automaton
Unurestricted grammar Turing machine

References

John E. Hopcroft and Jeffrey D. Ullman.  Introduction to automata theory, languages, and computation.  Addison-Wesley, 1979.  (This is the classic text.) 

John C. Martin, Introduction to languages and the theory of computation.  McGraw-Hill, 1997.  (This book is more recent and more accessible.) 

Share-Alike Made with jEdit Valid CSS! Valid HTML 4.01! UC Irvine Thomas A. Alspaugh
Assistant Professor, Informatics Dept.
School of Information and Computer Sciences