Regular Expressions

Introduction:

Regular Expressions are patterns that we can specify (as strings) and use to
search for and replace text in other strings (often the sequence of strings that
makes up the lines in a file). Python (as well as many other languages) includes
a module to perform various operations on regular expressions. In these
lectures, we will cover the form of regular expressions (in the first lecture),
and then what functions/methods can take regular expressions as arguments, and
how to use the results of matching regular expressions against text -including
the concept of capturing groups (in the second lecture).

In the first lecture we will discuss the components of regular expression
patterns themselves. We will discuss each component individually, and the ways
to combine these components into more complicated regular expressions (just as
we studied the syntax of a few simple control structures in Python, which we can
combine into more complicated control structures). We will cover many -but
not all- of the patterns usable in regular expressions: there are entire books
written on regular expressions: on Amazon I found over 15 books with "Regular
Expressions" in their title.

In the second lecture we will examine various functions and methods, from a
Python module, which take regular expressions as arguments. Typically they
match a regular expression pattern string against some text string, and return
information about whether or not the match succeeded, and what parts of the
pattern matched which parts of the text.

Python's module for doing these operations is named re. There is a special
tester module that accompanies these lectures, which you can download and run
to experiment with regular expressions and learn how they match text strings.
We will also look at an online resource for learning and testing regular
expressions, which is simpler to use.

For more complete information about regular expressions, see Section 6.2 of the
Python Standard Library.

------------------------------------------------------------------------------

				Lecture 1


General Rule for Matching Regular Expressions Patterns against Text
  (both are represented as strings):

  Regular expression patterns match the most number of characters possible in
  the text; this is called a greedy match.

  There are ways to specify patterns that match the fewest number of characters
  possible: although we will mention these, we will not discuss nor use those
  "non-greedy" specifications.


Matching:
  Characters generally match themselves, except for the following...

Metacharacters
 .	Matches any single character (except newline: \n)
 []	Matches one character specified inside square brackets []; e.g., [aeiou]
 -      Matches one character in range inside []: e.g., [0-9] matches any digit
        To match any letter (upper/lower case) or digit we write [A-Za-z0-9]
 [^]	Matches any one character NOT specified inside [^]; e.g., [^aeiou]
        We can use ranges here too, to specify the characters NOT to match

Anchors (these don't match characters, but match positions in a string)
 ^	matches beginning of line (except when used in [^])
 $	matches end of line (except when used in [] or [^])

To start, in all of the patterns that we experiment with/use, we will specify
^ as the first character and $ as the last. This will require a full match of 
the pattern and text, from the first to last character in both (much like how
we discussed matching EBNF rules to symbols). Later, in a few cases, we will
deviate from this convention.

Patterns: Assume R, Ra, Rb represent regular expression patterns (like the ones
          above, and also including the ones built below), here are the more
          complicated patterns that we can build from these.

 RaRb	Matches a sequence (one after the other) of Ra followed by Rb

 Ra|Rb	Matches either alternative Ra or Rb (just like | in ENBF)

 R?	Matches regular expression R 0/1 time: R is optional (like [] in EBNF)

 R*	Matches regular expression R 0 or more times (like {} in EBNF)

 R+	Matches regular expression R 1 or more times (note */+ difference)

 R{m}	Matches regular expression R exactly m times: e.g., R{5} = RRRRR

 R{m,n}	Matches regular expression R at least m and at most n times:
          R{3,5} = RRR|RRRR|RRRRR = RRRR?R?

----- Side-note
Included only in this side-note, but NOT USED in this lecture/course, are the
patterns R??, R*?, R+?,and  R{m,n}?. The postfix ? means match as FEW characters
as possible (not the most: so not greedy). We will NOT USE or need to understand
these patterns.
-----

Parenthesized Patterns

Parentheses are used for 
   1) Grouping (as they are in arithmetic expressions/formulas)
   2) Remembering the text matching sub-patterns (called a "Capturing Group").
      These capturing groups are mostly relevant when we examine the functions
        in the Python re module.

By placing sub-pattern R in parentheses, the text matching R will be remembered
(either by its number, starting at 1, or its name, if named: see ?P below) in a
"capturing group" which we mostly refer to just as a "group". We can use these
groups when extracting information from the matched text, when we call re
functions.

 (R)	     Matches R and delimits a group (1, 2, ...) (remembers/captures
               matched text in a group). The rules below show how capturing
               groups are numbered: they are simple.

 (?P<name>R) Matches R and remembers/captures matched text in a group
               using name for the group (it is still numbered as well); see
               the groupdict method below (2nd lecture) for uses of "name".
             An example is (?P<id>[A-Za-z0-9]+) which matches any sequence of
               upper and lower-case letters and digits with the capture group
               named id.

 (?:R)       Matches R but does not remember/capture matched text in a capturing
               group.
             So here (?:R) is used only for grouping regular expression R, not
               specifying a capturing group. So, ?: is useful when we want only
               important/capturing groups to be numbered (and not others). It
               takes extra space to specify.

-----Side-note
Included only in this side-note, but NOT USED in this lecture/course

 (?P=name)   Matches remembered text with name (for back-referencing which text)

 (?=R)	     Matches R, doesn't remember matched text or consume text matched
             For example (?=abc)(.*) matches abcxyz with group 1 'abcxyz';
               it doesn't match abxy because this text doesn't start with abc

 (?!R)       Matches anything but R and does not consume input needed for match
               (hint: != means "not equal", ?!R means "not matching R)
-----

Computing the group number of a sub-pattern for extracting its matching text:

  Each new left parenthesis starts a new group (unless it is "(?:...)"). Some
  groups are sequential (one after another); some groups are nested (one inside
  another). Here is an example

                         ( (  ) (  ) ( ( (  ) ) ) )
  Where groups start/end 1 2  2 3  3 4 5 6  6 5 4 1     
  Group 1                (------------------------)
  Group 2                  (--)
  Group 3                       (--)
  Group 4                            (----------)
  Group 5                              (------)
  Group 6                                (--)

Capturing Group 0 (in Python) is considered the entire regular expression, even
when it is not in any parentheses. So, in the pattern "^a(b(c.d)+)?(e)$
  Capturing Group 0 is a(b(c.d)+)?e
  Capturing Group 1 is (b(c.d)+)?
  Capturing Group 2 is (c.d)
  Capturing Group 3 is (e)

If we match this pattern against the string "abc1dc2de" then 
  Capturing Group 0 is "abc1dc2de": the entire string matches
  Capturing Group 1 is "bc1dc2d": everything in between a and (e)
  Capturing Group 2 is "c2d", the last characters matched in this repetition
  Capturing Group 3 is "e", from the last Capturing Group

Note that "ae" matches too, 
  Capturing Group 0 is "ae": the entire string matches
  No Capturing Group 1 (or 2 inside 1): ? ignores Capturing Group 1
  Capturing Group 3 is "e", from the last Capturing Group

The Regular Expressions website we use displays Group 0 as "Full Match".

----------
Notes on Repetition and Option and Capture Groups:

There is a difference between the regular expression ([ab]+) and ([ab])+ in
terms of what their capture groups capture.

  re.match('([ab]+)', 'aabba') has capture group 1 'aabba' 

  re.match('([ab])+', 'aabba') has capture group 1 'a' 

In the first example, + is INSIDE the group/parentheses so the group contains
all the characters matched in the repetition. In the second the + is OUTSIDE the
group/parentheses so the group contains only the character matched LAST in the
repetition. Both patterns match the same symbols, but yield different capture
groups.  The * operator is treated similarly.

Also, remember for (a?) the capture group is either 'a' or ''; and for (a)? 
the capture group is either a or None - meaning the capture groups was
discarded so there is no capture group, not even ''; while (a*) is an empty
string, a string with one a, a string with two as, etc.
----------

Context
  1)  - matches itself if not in [] between two characters: e.g., [-+] vs. [a-z]
  2) Special characters (except \) are treated literally in []: e.g, [.]
       matches only ".". To match the \ special character write [\\]. Writing
       [a-z] matches all lower-case characters; [a\-z] matches a, -, or z.
  3) Generally, if interpreting a character makes no sense one way, try to find
       another way to interpret it that fits the context

Escape Characters with Special Meanings
Symbol  Meaning
 \	Used before .|[]-?*+{}()^$\ (and others) to specify a special character
        The meaning of \. is the same as [.] (see rule 2 above)

 \t	tab
 \n	newline
 \r	carriage return
 \f	formfeed
 \v	vertical tab

 \d	[0-9]			Digit
 \D	[^0-9]			non-Digit
 \s	[ \t\n\r\f\v]		White space
 \S	[^ \t\n\r\f\v]		non-White space
 \w	[a-zA-Z0-9_]		alphanumeric(or underscore): used in identifiers
 \W	[^a-zA-Z0-9_]		non alphanumeric

Interesting Equivalences
 a+         == aa*
 a(b|c|d)e  == a[bcd]e      only if b, c, and d are single characters
 R{0,1}     == R?           0 or 1 times means the same as optional


Hints on Using | (a low-precedence operator for Regular Expression)

In Python, we know that writing a*b+c*d performs * before +: we say * has higher
precedence than +, so it is performed earlier. We could be explicit and write
this as (a*b)+(c*d). If we wanted to add BEFORE multiply, we must use
parentheses: a*(b+c)*d. Think of REs as having sequence as an operation (it is
implicit, with no operator written between the regular expressions in the
sequence).

The sequence precedence is lower than the precedence of all postfix operators
(like ?, *, +, and {}): e.g., ab* has the same meaning as a(b*). The | operator
has the lowest precedence of all (even lower than implicit sequencing). So
writing ab|cd is the equivalent of writing  (ab)|(cd), which matches either ab
or cd only: the pattern does the sequencing BEFORE the | operator.

Now, given this understanding, look what ^a|b$ means. By above, it means the
same as (^a)|(b$). Type ^a|b$ into the online tool and read its Explanation).
Note that the ^ anchor applies only to a, and the $ anchor applies only to b
(see its parenthesized equivalent). So ^a|b$, which is equivalent to (^a)|(b$),
will match (using the online tool)

  1) any text starting with an a: a or aa or aaab or abcda ($ is not part of it)
  2) any text ending   with a  b: b or cb or ccbb or abcdb (^ is not part of it)

To avoid confusion, I strongly recommend ALWAYS WRITING ALL THE ALTERNATIVES IN
A REGULAR EXPRESSION AS A GROUP IN PARENTHESES: ^(a|b)$ to ensure that the |
applies only to the alternatives inside the ()s. This regular expression is a
sequence of 3 regular expressions: the ^ anchor, followed by a choice of a or
b, and finally followed by the $ anchor.

So sometimes we use () to ensure that our regular expression is correct, and in
the process, we automatically introduce a new numbered (capturing) group. Of
course we can write this as ^(?:Ra|b)$ to ensure that no new Capturing Group
number is created by our grouping.



Problems:

Write the smallest pattern that matches the required characters. Check your
patterns with the Regular Expression Tester (see the Sample Programs link) to
ensure they match correct exemplars and don't match incorrect ones. Note that
for a match, group #0 should include all the required characters.

1. Write a regular expression pattern that matches the strings Jul 4, July 4,
   Jul 4th, July 4th, Jul fourth, July fourth, July Fourth, and July Fourth.
   Hint: my RE pattern was 24 characters.

2. Write a regular expression pattern that matches strings representing times on
   a 12 hour clock. An example time is  5:09am or 11:23pm. Allow only times that
   are legal (not 1:73pm nor 13:02pm)
   Hint: my RE pattern was 32 characters.

3. Write a regular expression pattern that matches strings representing phone
   numbers of the following form.

   Normal: a three digit exchange, followed by a dash, followed by a four digit
           number: e.g., 555-1212

   Long Distance: a 1, followed by a dash, followed by a three digit area code
           enclosed in parentheses, followed by a three digit exchange,
           followed by a dash, followed by a four digit number: e.g.,
           1-(800)555-1212

   Interoffice: a single digit followed by a dash followed by a four digit
            number: e.g., 8-2404.

   Hint: my RE pattern was 30 characters; note that you must use \( and \) to
   match parentheses.

4. Write a regular expression pattern that matches strings representing simple
   integers: optional + or - signs followed by one or more digits.
   Hint: my RE pattern was 8 characters.

5. Write a regular expression pattern that matches strings representing
   normalized integers (each number is either an unsigned 0 or is unsigned or
   signed and starts with a non-0 digit) with commas in only the correct
   positions
   Hint: my RE pattern was 30 characters.

6. Write a regular expression pattern that matches strings representing float
   values. They are unsigned or signed (but not normalized: see 5) and any
   number of digits before or after a decimal point (but there must be at least
   one digit either before or after a decimal point: e.g., just . is not
   allowed) followed by an optional e or E followed by an unsigned or signed
   integer (again not normalized).
   Hint: my RE pattern was 36 characters.

7. Write a regular expression pattern that matches strings representing trains.
   A single letter stands for each kind of car in a train: Engine, Caboose,
   Boxcar, Passenger car, and Dining car. There are four rules specifying how
   to form trains.
     1. One or more Engines appear at the front; one Caboose at the end.
        All other cars must come between the late Engine and the Caboose
     2. Boxcars always come in adjacent pairs: BB, BBBB, etc.
     3. There cannot be more than four Passenger cars in a series.
     4. Each series of Passenger cars must be followed by a Dinning car, and
        Dinning cars can appear only after a Passenger car.
   These cars cannot appear anywhere other than these locations. Here are
   some legal and illegal exemplars.

     EC Legal: the smallest train
     EEEPPDBBPDBBBBC Legal  : simple train showing all the cars
     EEBB            Illegal: no caboose (everything else OK)
     EBBBC           Illegal: three boxcars in a row
     EEPPPPPDBBC     Illegal: more than four passenger cars in a row
     EEPPBBC         Illegal: no dining car after passenger cars
     EEBBDC          Illegal: dining car after box car
   Hint: my RE pattern was 16 characters.

------------------------------------------------------------------------------

				Lecture 2

Generally, the functions discussed in this lecture operate on a regular
expression pattern (specified by a string) and text (also specified by a
string). These functions produce information (capture groups: see parenthesized
patterns above) related to attempting to match the pattern and text: which parts
of the text matched which parts of the pattern.

We can use the compile function to compile a pattern (producing a regex), and 
then call methods on that regex directly, as an object to perform the same
operations as the functions, but more efficiently if the pattern is to be used
repeatedly (since the pattern is compiled into the regex once, not in each
function call).

We will omit discussing/using the [,flags] option in this discussion, but see
section 6.2 of the Python Library Documentation for a discussion of A/ASCII,
DEBUG, I/IGNORECASE, L/LOCALE, M/MULTILINE, S/DOTALL, and X/VERBOSE.

----------

A (not too) Simple but Illustrative Example
  (the details of HOW this works follow in later sections):

In this example (representative of what we do with regular expressions) we 
  1) define a regular expression
  2) call a re function (match) on it and some text
  3) check whether the pattern matches the text
  4) do something with (print) the captured groups (by number or name)

A 2nd version is shown further down, using the re.compile function and calling
.match on the object it returns.

phone = r'^(?:\((\d{3})\))?(\d{3})[-.](\d{4})$'
m = re.match(phone,'(949)824-2704')
assert m != None, 'No match'
print(m.groups())
area, exchange, number = [int(i) if i != None else None for i in m.group(1,2,3)]
print(area, exchange, number)

1) Here, phone is a  pattern anchored at both ends (by ^ and $ respectively).

(a) r'...' is a raw string; we typically use them to specify patterns

    It starts with ^(?:\((\d{3})\))?
    controlling an optional area code. The ?: means that the parentheses are not
    used to create a group, but are used with the ? (postfix option) symbol.

    Inside it is \((\d{3})\): a left parenthesis \(, group 1 which consists of
    any 3 digits,  and a right parenthesis \).

(b) Next is (\d{3}) group 2, which consists of any 3 digits. 

(c) Next is [-.] that is one symbol, either a - or . (not in a group).

(d) Next is (\d{4}) group 3, which consists of any 4 digits. 

2) Calling the re.match function matches the pattern against some text, it 
returns a match object that is bound to m.

3) If the match m is None, there is no match (raises AssertionError exception).

4) Converts every non-None string from groups 1, 2, and 3 into an int.

5) Prints the the groups

Try also replacing line 2 by 

  m = re.match(phone,'824-2704')		# area is None
  m = re.match(phone,'(949)824:2704')		# : instead of - or .; no match
  m = re.match(phone,'(94)824-2704')		# only 2 in area code; no match

Also, we can replace the first two lines by the following equivalent lines

phone_pat = re.compile(r'^(?:\((\d{3})\))?(\d{3})[-.](\d{4})$')
m = phone_pat.match('(949)824-2704')

In this example, because match is called only once, there is no speed 
improvement by compiling the pattern. But if we called match on multiple text
strings (e.g., every string read from a file), then using compile would be more
efficient. The compiling feature is discussed below.

----------

Regular Expression (re) functions:
  called like re.match(...) the module name prefaces the function

Returns a regex (compiled pattern) object (see calling methods on regex below)
  compile   (pattern, [,flags])		Creates compiled pattern object

Returns a match object, consisting of tuple of groups (0,1,...)
  match	    (pattern, text [,flags])	Matches start at the text's beginning
  search    (pattern, text [,flags])	Matches can start anywhere in the text

---The online regular expression web page uses SEARCH to do its matching, which
---is why we wrote our patterns as ^....$, to ensure matching started at the
---beginning and ended at the end of each line.

  re.match ("(a+)b","aaab") matches; re.match ("(a+)b","xaaab") doesn't match
  re.search("(a+)b","aaab") matches; re.search("(a+)b","xaaab") matches
    by writing the patterns like ^...$, these functions produce the same results

Returns a list of strings: much like calling text.split(...)
  split     (pattern, text [,maxsplit, flags]) like the text.split(...) method,
               but using a regular expressions pattern to determine how to split
               the text: re.split('[.-]', 'a.b-c')  returns ['a','b','c'],
               splitting on either . (a period; . here does not mean "any
               character") or -. The standard string split function,
               text.split(...) can't split on EITHER character; note that
               'a.b-c'.split(".-") splits only on '.-' both a . followed by a -,
               so in this case it fails to split anywhere, since '.-' is not
               anywhere in the text at all. Note we can also write
               re.split('(?:\.|-)', 'a.b-c') which also returns ['a','b','c'].
               Note we must write \. here because . would mean any character) 
            If the pattern has groups (this one uses ?: so doesn't), then the
              text matching each group is included in the resulting list too:
              use ?: to avoid these groups.
            So re.split(';+'  ,'abc;d;;e') returns ['abc', 'd', 'e'] and
               re.split('(;+)','abc;d;;e') returns ['abc', ';', 'd', ';;', 'e'] 

Returns a string
  sub       (pattern, repl, text, [,count, flags])
             If there is a match between pattern and text, build a string that
               a)  replaces pattern by repl, which is a string that can refer to
                   matched groups via \# (e.g. \1) or \g<#>, (e.g., \g<1>), or
                   \g<name> (where name comes from ?P<name>)
               b) replaces pattern by the result returned by CALLING repl, which
                  is a function that is passed a match object as its argument)
             If there is no match, then it just returns the text parameter's
               value, unchanged
            re.sub('(a+)','{as}','aabcaaadaf') returns    {as}bc{as}d{as}f
            re.sub('(a+)','(\g<1>)','aabcaaadaf') returns (aa)bc(aaa)d(a)f

----- Side-note
Included only in this side-note, but NOT USED in this lecture/course, are the

Returns a list of string/of tuples of string (the groups), specifying matches
  findall   (pattern, text [,flags])	Matches can start anywhere in the text;
              the next attempted match starts one character after the previous
              match terminates.
            If the pattern has groups, then the string matching each group is
              included in the resulting list too: use ?: to avoid these groups
            re.findall('a*b','abaabcbdabc') returns ['ab', 'aab', 'b', 'ab']
            re.findall('((a*)(b))','abaabcbdabc') returns 
              [('ab','a','b'), ('aab','aa','b'), ('b','','b'), ('ab','a','b')]

Returns a iterable of the information returned by findall (ignore this one)
  finditer  (pattern, text [,flags])	Returns iterable equivalent of findall

  subn      same as sub but returns a tuple: (new string, number of subs made)
  escape    (string): strng with nonalphanumeric back-slashed
-----

In findall and sub/subn, only non-overlapping patterns are found/replaced:
in text aaaa there are two non-overlapping occurrence of the pattern aa:
starting in index 0 and 2 (not in index 1, which overlaps with the previous
match in indexes 0-1).

------------------------------------------------------------------------------

regex (compiled pattern) object methods (see the compile method above, which
produces regexes) are called like c = re.compile(p). It is then efficient to
call c.match(...) many times. Calling re.match(p,...) many times with the same
pattern re-compiles and matches the pattern each time re.match is called;
whereas c = re.compile(p) compiles the pattern ONCE and c.match(...) just has to
match it against text each time that it is called.

Using this feature allows us to compile a pattern and reuse it for all the
operations above: re.match(p,s) is just like re.compile(p).match(s); if we
are doing MANY MATCHES WITH THE SAME PATTERN, it is most efficient to compile
the pattern once and then use the compiled pattern with the match method below
many times (as illustrated above). We often compile patterns when using them to
match against all the lines in a file.

pos/endpos are options that specify where in text the match starts and ends
(from pos to endpos-1). pos defaults to 0 (the beginning of the text) and
endpos defaults to the length of the text so endpos-1 is its last character).

Each of the re functions above has an equivalent method using a compiled
pattern to call the method, but omitting the pattern from its argument list.

  match    (text [,pos][,endpos])	  See match above, with pos/endpos
  search   (text [,pos][,endpos])	  See search above, with pos/endpos
  findall  (text [,pos][,endpos])	  See findall above, with pos/endpos
  finditer (text [,pos][,endpos])	  See finditer above, with pos/endpos
  split    (text [,maxsplit])		  See split above, with pos/endpos
  sub	   (repl, text [,count])	  See sub above, with pos/endpos
  subn	   (repl, text [,count])	  See subn above, with pos/endpos

So, for example, instead of writing

  for line in open_file:
    ...re.match(pattern_string,line)

which implicitly compiles the same pattern_string during each loop iteration
(whenever re.match executes) we can write

  for line in open_file:
    pattern = re.compile(pattern_string)
    ...pattern.match(line)

which explicitly compiles the pattern_string during each loop and uses the
compiled version (instead of the function re.match) to call "match" (just two
ways of doing the same thing). NOW...if we know the pattern_string stays the
same, we can factor the call to re.compile out before the loop executes and
write it as

  pattern = re.compile(pattern_string)
  for line in open_file:
    ...pattern.match(line)

which explicitly compiles the pattern_string ONCE, before the loop executes,
and calls "match" on it during each loop iteration. This is the most efficient
way to test the same pattern against every line in a file. See the grep.py
module in the remethods download that accompanies this lecture for code that
calls re.compile.

------------------------------------------------------------------------------

Match objects and (Capture) Groups

Match objects record information about which parts of a pattern match the text.
Each group (referred to by either its number or an optional name) can be used
as an argument to a function that specifies information about the start, end,
span, and characters in the matching text.

Calling match/search produces None or a match object
Calling findall produces None, a list of strings (if there are no groups) or a
  list of tuples of strings (if there are groups, with the tuple index
  representing the each group #)
Calling finditer produces None or an iterable of groups (not used in the course)

Each group is indexed by a number or name (a name only when the group was
  delimited by (?P<name>)); group 0 IS ALL THE CHARACTER IN THE MATCH, groups
  1-n are for delimited matches inside. For example, in the pattern (a)(b(c)(d))
  the a is in group 1, the b is in group 2 (group 2 includes groups 3 and 4),
  c is in group 3, and d in is group 4: groups are numbered by in what order we
  reach their OPENING parenthesis. that is why group 2 includes all of b(c)(d).

  Note that if a parenthesized expression looks like (?:...) it is NOT numbered
  as a group. So in (a)(?:b(c)(d)) the a is in group 1, the b is in NO group,
  c is in group 2, and d in is group 3.

  If a group is followed by a ? and the pattern in the group is skipped, its
  group will be None. In the result of re.match('a(b)?c','ac') group 1 will be
  None. If the group itself is not optional, but the text inside the group is,
  the group will show as matching an empty string. So the result of
  re.match('a(b?)c','ac') group 1 will be ''. The same is true for a repetition
  that matches 0 times. Compare re.match('a(b)*c','ac') group 1 and
  re.match('a(b*)c','ac') group 1.

  If a group matches multiple times (e.g., a(.)*c ), only its last match is
  available, so for axyzc group 1, the (.) group, is bound to the character z.
  If we wrote this as a(.*)c the (.*) group is bound to the characters xyz. If
  we wrote it as a((.)*)c group 1 is xyz and grups 2 is just z.

Printing the .groups() method called on a match object prints a tuple of the
matching characters for each group 1-n (not group #0)

We can look at each resulting group by its number (including group #0), using
any of the following methods that operate on match objects

    group(g)		text of group with specified name or number
    group(g1,g2, ... )  tuple of text of groups with specified name or number
    groups()		tuple of text of all groups but #0 (can iterate over tuple)
    groupdict()		text of all groups as dict (see ?P<name> for keys)
    start([group])	starting index of group (or entire matched string)
    end([group])	ending index of group (or entire matched string)
    span([group])	tuple: (start([group]), end([group]))
  
Try doing some matches and calling .groups() on the result.

Unzip remethods.zip and examine the phonecall.py and readingtest.py modules
for examples of Python programs that use regular expressions (and groups) to
perform useful computations.



Loose Ends:

1) Raw Strings

When writing regular expression pattern strings as arguments in Python it is
best to use raw strings: they are written in the form r'...' or r"...". These
should be used because of an issue dealing with using the backslash character in
patterns, which is sometimes necessary. For example, in normal strings when
you write '\n' Python turns that into a 1 character string with the newline
character: len('\n') is 1. But with raw strings, writing r'\n' specifies a
string with a backslash followed by an n: len(r'\n') is 2. Normally this isn't
a big issue because writing '\d' or' \*' in normal strings doesn't generate an
escape character, since there is no escape character for d or ( so len('\d') and
len('\*') is 2.


2) **d in function/method calls (where d is a dict)

If we call a function we can specify **d as one or more of its arguments. For
each **d, Python uses all its keys as parameter names and all its values as
default arguments for these parameter names. For example

  f(**{'b':2, 'a':1, 'c':3} ) is translated by Python into f(b=2,a=1,c=3)

Note that this is useful in regular expressions if we use the (?P<name> ...)
option and then the groupdict() method for the match it produces.

There is also a version that works the other way. Suppose we have a functions
whose header is

  def f(x,y,**kargs):  # The typical name is **kargs

if we call it by f(1,2,a=3,b=4,c=5) then

  x     is bound to 1
  y     is bound to 2
  kargs is bound to a dictionary {'b':4, 'a':3, 'c':5}

See the argument/parameter matching rules from the review lectuer for a
complete description of what happens.

So (in reverse of the order explained above) ** as a parameter creates a
dictionary of "extra" named-arguments supplied when the function is called, and
** as an argument supplies lots of named-arguments to the function call. We
will cover this information again when we examine inheritance

The parse_phone_named method (in phoncecall.py) uses this language feature.


3) Translation of a Regular Expression Pattern into a NDFA

How do the functions/methods in re compile a regular expression string and match
it against a text string? It translates every regular expression into a
non-deterministic finite automaton (see Programming Assignment #1, part 4), and
then matches against the text (ibid) to see if the match succeeds (reaches the
special last state).

The general algorithm (known as Thompson's Algorithm) is a bit beyond the scope
of this course and uses a concept we haven't discussed (epsilon -aka empty-
transitions), but you can look up the details if you are interested. Here is
an example for the regular expression pattern ((a*|b)cd)+. It produces an NDFA
described by

start;a;1;a;2;b;2;c;3
1;a;1;a;2
2;c;3
3;d;start;d;last
last

This pattern matches a text string by starting in state 'start' and exhausting
all the characters and having 'last' in its possible states at the end.






Problems:

Write functions using regular expression patterns

8. Write a function named contract that takes a string as a parameter. It
substitutes the word 'goal' to replace any occurrences of variants of this word
written with any number of o's, e.g., 'gooooal') in its argument. So calling
contract('It is a goooooal! A gooal.') returns 'It is a goal! A goal.'.

9. Write a function named grep that takes a regular expression pattern string
and a file name as parameters. It returns a list of 3-tuples consisting of the
file-name, line number, and line of the file, for each line whose text matches
the pattern. Hint: Using enumerate and a comprehension, this is a 3 line
function, but you can use explicit looping in a longer function.

10. Write a function named name_convert that takes two file names as
parameter. It reads the first file (which should be a Python program) and
writes each line into the second file, but with identifiers originally written
in camel notation converted to underscore notation: e.g. aCamelName converts to
a_camel_name. Camel identifiers start with a lower-case letter followed by
upper/lower-case letters and digits: each upper-case letter is preceded by an
underscore and turned into a lower-case letter.