Regular Expressions Introduction: Regular Expressions are patterns that we can specify (as strings) and use to search for and replace text in other strings (often the sequence of strings that makes up the lines in a file). Python (as well as many other languages) includes a module to perform various operations on regular expressions. In these lectures, we will cover the form of regular expressions (in the first lecture), and then what functions/methods can take regular expressions as arguments, and how to use the results of matching regular expressions against text -including the concept of capturing groups (in the second lecture). In the first lecture we will discuss the components of regular expression patterns themselves. We will discuss each component individually, and the ways to combine these components into more complicated regular expressions (just as we studied the syntax of a few simple control structures in Python, which we can combine into more complicated control structures). We will cover many -but not all- of the patterns usable in regular expressions: there are entire books written on regular expressions: on Amazon I found over 15 books with "Regular Expressions" in their title. In the second lecture we will examine various functions and methods, from a Python module, which take regular expressions as arguments. Typically they match a regular expression pattern string against some text string, and return information about whether or not the match succeeded, and what parts of the pattern matched which parts of the text. Python's module for doing these operations is named re. There is a special tester module that accompanies these lectures, which you can download and run to experiment with regular expressions and learn how they match text strings. We will also look at an online resource for learning and testing regular expressions, which is simpler to use. For more complete information about regular expressions, see Section 6.2 of the Python Standard Library. ------------------------------------------------------------------------------ Lecture 1 General Rule for Matching Regular Expressions Patterns against Text (both are represented as strings): Regular expression patterns match the most number of characters possible in the text; this is called a greedy match. There are ways to specify patterns that match the fewest number of characters possible: although we will mention these, we will not discuss nor use those "non-greedy" specifications. Matching: Characters generally match themselves, except for the following... Metacharacters . Matches any single character (except newline: \n) [] Matches one character specified inside square brackets []; e.g., [aeiou] - Matches one character in range inside []: e.g., [0-9] matches any digit To match any letter (upper/lower case) or digit we write [A-Za-z0-9] [^] Matches any one character NOT specified inside [^]; e.g., [^aeiou] We can use ranges here too, to specify the characters NOT to match Anchors (these don't match characters, but match positions in a string) ^ matches beginning of line (except when used in [^]) $ matches end of line (except when used in [] or [^]) To start, in all of the patterns that we experiment with/use, we will specify ^ as the first character and $ as the last. This will require a full match of the pattern and text, from the first to last character in both (much like how we discussed matching EBNF rules to symbols). Later, in a few cases, we will deviate from this convention. Patterns: Assume R, Ra, Rb represent regular expression patterns (like the ones above, and also including the ones built below), here are the more complicated patterns that we can build from these. RaRb Matches a sequence (one after the other) of Ra followed by Rb Ra|Rb Matches either alternative Ra or Rb (just like | in ENBF) R? Matches regular expression R 0/1 time: R is optional (like [] in EBNF) R* Matches regular expression R 0 or more times (like {} in EBNF) R+ Matches regular expression R 1 or more times (note */+ difference) R{m} Matches regular expression R exactly m times: e.g., R{5} = RRRRR R{m,n} Matches regular expression R at least m and at most n times: R{3,5} = RRR|RRRR|RRRRR = RRRR?R? ----- Side-note Included only in this side-note, but NOT USED in this lecture/course, are the patterns R??, R*?, R+?,and R{m,n}?. The postfix ? means match as FEW characters as possible (not the most: so not greedy). We will NOT USE or need to understand these patterns. ----- Parenthesized Patterns Parentheses are used for 1) Grouping (as they are in arithmetic expressions/formulas) 2) Remembering the text matching sub-patterns (called a "Capturing Group"). These capturing groups are mostly relevant when we examine the functions in the Python re module. By placing sub-pattern R in parentheses, the text matching R will be remembered (either by its number, starting at 1, or its name, if named: see ?P below) in a "capturing group" which we mostly refer to just as a "group". We can use these groups when extracting information from the matched text, when we call re functions. (R) Matches R and delimits a group (1, 2, ...) (remembers/captures matched text in a group). The rules below show how capturing groups are numbered: they are simple. (?PR) Matches R and remembers/captures matched text in a group using name for the group (it is still numbered as well); see the groupdict method below (2nd lecture) for uses of "name". An example is (?P[A-Za-z0-9]+) which matches any sequence of upper and lower-case letters and digits with the capture group named id. (?:R) Matches R but does not remember/capture matched text in a capturing group. So here (?:R) is used only for grouping regular expression R, not specifying a capturing group. So, ?: is useful when we want only important/capturing groups to be numbered (and not others). It takes extra space to specify. -----Side-note Included only in this side-note, but NOT USED in this lecture/course (?P=name) Matches remembered text with name (for back-referencing which text) (?=R) Matches R, doesn't remember matched text or consume text matched For example (?=abc)(.*) matches abcxyz with group 1 'abcxyz'; it doesn't match abxy because this text doesn't start with abc (?!R) Matches anything but R and does not consume input needed for match (hint: != means "not equal", ?!R means "not matching R) ----- Computing the group number of a sub-pattern for extracting its matching text: Each new left parenthesis starts a new group (unless it is "(?:...)"). Some groups are sequential (one after another); some groups are nested (one inside another). Here is an example ( ( ) ( ) ( ( ( ) ) ) ) Where groups start/end 1 2 2 3 3 4 5 6 6 5 4 1 Group 1 (------------------------) Group 2 (--) Group 3 (--) Group 4 (----------) Group 5 (------) Group 6 (--) Capturing Group 0 (in Python) is considered the entire regular expression, even when it is not in any parentheses. So, in the pattern "^a(b(c.d)+)?(e)$ Capturing Group 0 is a(b(c.d)+)?e Capturing Group 1 is (b(c.d)+)? Capturing Group 2 is (c.d) Capturing Group 3 is (e) If we match this pattern against the string "abc1dc2de" then Capturing Group 0 is "abc1dc2de": the entire string matches Capturing Group 1 is "bc1dc2d": everything in between a and (e) Capturing Group 2 is "c2d", the last characters matched in this repetition Capturing Group 3 is "e", from the last Capturing Group Note that "ae" matches too, Capturing Group 0 is "ae": the entire string matches No Capturing Group 1 (or 2 inside 1): ? ignores Capturing Group 1 Capturing Group 3 is "e", from the last Capturing Group The Regular Expressions website we use displays Group 0 as "Full Match". ---------- Notes on Repetition and Option and Capture Groups: There is a difference between the regular expression ([ab]+) and ([ab])+ in terms of what their capture groups capture. re.match('([ab]+)', 'aabba') has capture group 1 'aabba' re.match('([ab])+', 'aabba') has capture group 1 'a' In the first example, + is INSIDE the group/parentheses so the group contains all the characters matched in the repetition. In the second the + is OUTSIDE the group/parentheses so the group contains only the character matched LAST in the repetition. Both patterns match the same symbols, but yield different capture groups. The * operator is treated similarly. Also, remember for (a?) the capture group is either 'a' or ''; and for (a)? the capture group is either a or None - meaning the capture groups was discarded so there is no capture group, not even ''; while (a*) is an empty string, a string with one a, a string with two as, etc. ---------- Context 1) - matches itself if not in [] between two characters: e.g., [-+] vs. [a-z] 2) Special characters (except \) are treated literally in []: e.g, [.] matches only ".". To match the \ special character write [\\]. Writing [a-z] matches all lower-case characters; [a\-z] matches a, -, or z. 3) Generally, if interpreting a character makes no sense one way, try to find another way to interpret it that fits the context Escape Characters with Special Meanings Symbol Meaning \ Used before .|[]-?*+{}()^$\ (and others) to specify a special character The meaning of \. is the same as [.] (see rule 2 above) \t tab \n newline \r carriage return \f formfeed \v vertical tab \d [0-9] Digit \D [^0-9] non-Digit \s [ \t\n\r\f\v] White space \S [^ \t\n\r\f\v] non-White space \w [a-zA-Z0-9_] alphanumeric(or underscore): used in identifiers \W [^a-zA-Z0-9_] non alphanumeric Interesting Equivalences a+ == aa* a(b|c|d)e == a[bcd]e only if b, c, and d are single characters R{0,1} == R? 0 or 1 times means the same as optional Hints on Using | (a low-precedence operator for Regular Expression) In Python, we know that writing a*b+c*d performs * before +: we say * has higher precedence than +, so it is performed earlier. We could be explicit and write this as (a*b)+(c*d). If we wanted to add BEFORE multiply, we must use parentheses: a*(b+c)*d. Think of REs as having sequence as an operation (it is implicit, with no operator written between the regular expressions in the sequence). The sequence precedence is lower than the precedence of all postfix operators (like ?, *, +, and {}): e.g., ab* has the same meaning as a(b*). The | operator has the lowest precedence of all (even lower than implicit sequencing). So writing ab|cd is the equivalent of writing (ab)|(cd), which matches either ab or cd only: the pattern does the sequencing BEFORE the | operator. Now, given this understanding, look what ^a|b$ means. By above, it means the same as (^a)|(b$). Type ^a|b$ into the online tool and read its Explanation). Note that the ^ anchor applies only to a, and the $ anchor applies only to b (see its parenthesized equivalent). So ^a|b$, which is equivalent to (^a)|(b$), will match (using the online tool) 1) any text starting with an a: a or aa or aaab or abcda ($ is not part of it) 2) any text ending with a b: b or cb or ccbb or abcdb (^ is not part of it) To avoid confusion, I strongly recommend ALWAYS WRITING ALL THE ALTERNATIVES IN A REGULAR EXPRESSION AS A GROUP IN PARENTHESES: ^(a|b)$ to ensure that the | applies only to the alternatives inside the ()s. This regular expression is a sequence of 3 regular expressions: the ^ anchor, followed by a choice of a or b, and finally followed by the $ anchor. So sometimes we use () to ensure that our regular expression is correct, and in the process, we automatically introduce a new numbered (capturing) group. Of course we can write this as ^(?:Ra|b)$ to ensure that no new Capturing Group number is created by our grouping. Problems: Write the smallest pattern that matches the required characters. Check your patterns with the Regular Expression Tester (see the Sample Programs link) to ensure they match correct exemplars and don't match incorrect ones. Note that for a match, group #0 should include all the required characters. 1. Write a regular expression pattern that matches the strings Jul 4, July 4, Jul 4th, July 4th, Jul fourth, July fourth, July Fourth, and July Fourth. Hint: my RE pattern was 24 characters. 2. Write a regular expression pattern that matches strings representing times on a 12 hour clock. An example time is 5:09am or 11:23pm. Allow only times that are legal (not 1:73pm nor 13:02pm) Hint: my RE pattern was 32 characters. 3. Write a regular expression pattern that matches strings representing phone numbers of the following form. Normal: a three digit exchange, followed by a dash, followed by a four digit number: e.g., 555-1212 Long Distance: a 1, followed by a dash, followed by a three digit area code enclosed in parentheses, followed by a three digit exchange, followed by a dash, followed by a four digit number: e.g., 1-(800)555-1212 Interoffice: a single digit followed by a dash followed by a four digit number: e.g., 8-2404. Hint: my RE pattern was 30 characters; note that you must use \( and \) to match parentheses. 4. Write a regular expression pattern that matches strings representing simple integers: optional + or - signs followed by one or more digits. Hint: my RE pattern was 8 characters. 5. Write a regular expression pattern that matches strings representing normalized integers (each number is either an unsigned 0 or is unsigned or signed and starts with a non-0 digit) with commas in only the correct positions Hint: my RE pattern was 30 characters. 6. Write a regular expression pattern that matches strings representing float values. They are unsigned or signed (but not normalized: see 5) and any number of digits before or after a decimal point (but there must be at least one digit either before or after a decimal point: e.g., just . is not allowed) followed by an optional e or E followed by an unsigned or signed integer (again not normalized). Hint: my RE pattern was 36 characters. 7. Write a regular expression pattern that matches strings representing trains. A single letter stands for each kind of car in a train: Engine, Caboose, Boxcar, Passenger car, and Dining car. There are four rules specifying how to form trains. 1. One or more Engines appear at the front; one Caboose at the end. All other cars must come between the late Engine and the Caboose 2. Boxcars always come in adjacent pairs: BB, BBBB, etc. 3. There cannot be more than four Passenger cars in a series. 4. Each series of Passenger cars must be followed by a Dinning car, and Dinning cars can appear only after a Passenger car. These cars cannot appear anywhere other than these locations. Here are some legal and illegal exemplars. EC Legal: the smallest train EEEPPDBBPDBBBBC Legal : simple train showing all the cars EEBB Illegal: no caboose (everything else OK) EBBBC Illegal: three boxcars in a row EEPPPPPDBBC Illegal: more than four passenger cars in a row EEPPBBC Illegal: no dining car after passenger cars EEBBDC Illegal: dining car after box car Hint: my RE pattern was 16 characters. ------------------------------------------------------------------------------ Lecture 2 Generally, the functions discussed in this lecture operate on a regular expression pattern (specified by a string) and text (also specified by a string). These functions produce information (capture groups: see parenthesized patterns above) related to attempting to match the pattern and text: which parts of the text matched which parts of the pattern. We can use the compile function to compile a pattern (producing a regex), and then call methods on that regex directly, as an object to perform the same operations as the functions, but more efficiently if the pattern is to be used repeatedly (since the pattern is compiled into the regex once, not in each function call). We will omit discussing/using the [,flags] option in this discussion, but see section 6.2 of the Python Library Documentation for a discussion of A/ASCII, DEBUG, I/IGNORECASE, L/LOCALE, M/MULTILINE, S/DOTALL, and X/VERBOSE. ---------- A (not too) Simple but Illustrative Example (the details of HOW this works follow in later sections): In this example (representative of what we do with regular expressions) we 1) define a regular expression 2) call a re function (match) on it and some text 3) check whether the pattern matches the text 4) do something with (print) the captured groups (by number or name) A 2nd version is shown further down, using the re.compile function and calling .match on the object it returns. phone = r'^(?:\((\d{3})\))?(\d{3})[-.](\d{4})$' m = re.match(phone,'(949)824-2704') assert m != None, 'No match' print(m.groups()) area, exchange, number = [int(i) if i != None else None for i in m.group(1,2,3)] print(area, exchange, number) 1) Here, phone is a pattern anchored at both ends (by ^ and $ respectively). (a) r'...' is a raw string; we typically use them to specify patterns It starts with ^(?:\((\d{3})\))? controlling an optional area code. The ?: means that the parentheses are not used to create a group, but are used with the ? (postfix option) symbol. Inside it is \((\d{3})\): a left parenthesis \(, group 1 which consists of any 3 digits, and a right parenthesis \). (b) Next is (\d{3}) group 2, which consists of any 3 digits. (c) Next is [-.] that is one symbol, either a - or . (not in a group). (d) Next is (\d{4}) group 3, which consists of any 4 digits. 2) Calling the re.match function matches the pattern against some text, it returns a match object that is bound to m. 3) If the match m is None, there is no match (raises AssertionError exception). 4) Converts every non-None string from groups 1, 2, and 3 into an int. 5) Prints the the groups Try also replacing line 2 by m = re.match(phone,'824-2704') # area is None m = re.match(phone,'(949)824:2704') # : instead of - or .; no match m = re.match(phone,'(94)824-2704') # only 2 in area code; no match Also, we can replace the first two lines by the following equivalent lines phone_pat = re.compile(r'^(?:\((\d{3})\))?(\d{3})[-.](\d{4})$') m = phone_pat.match('(949)824-2704') In this example, because match is called only once, there is no speed improvement by compiling the pattern. But if we called match on multiple text strings (e.g., every string read from a file), then using compile would be more efficient. The compiling feature is discussed below. ---------- Regular Expression (re) functions: called like re.match(...) the module name prefaces the function Returns a regex (compiled pattern) object (see calling methods on regex below) compile (pattern, [,flags]) Creates compiled pattern object Returns a match object, consisting of tuple of groups (0,1,...) match (pattern, text [,flags]) Matches start at the text's beginning search (pattern, text [,flags]) Matches can start anywhere in the text ---The online regular expression web page uses SEARCH to do its matching, which ---is why we wrote our patterns as ^....$, to ensure matching started at the ---beginning and ended at the end of each line. re.match ("(a+)b","aaab") matches; re.match ("(a+)b","xaaab") doesn't match re.search("(a+)b","aaab") matches; re.search("(a+)b","xaaab") matches by writing the patterns like ^...$, these functions produce the same results Returns a list of strings: much like calling text.split(...) split (pattern, text [,maxsplit, flags]) like the text.split(...) method, but using a regular expressions pattern to determine how to split the text: re.split('[.-]', 'a.b-c') returns ['a','b','c'], splitting on either . (a period; . here does not mean "any character") or -. The standard string split function, text.split(...) can't split on EITHER character; note that 'a.b-c'.split(".-") splits only on '.-' both a . followed by a -, so in this case it fails to split anywhere, since '.-' is not anywhere in the text at all. Note we can also write re.split('(?:\.|-)', 'a.b-c') which also returns ['a','b','c']. Note we must write \. here because . would mean any character) If the pattern has groups (this one uses ?: so doesn't), then the text matching each group is included in the resulting list too: use ?: to avoid these groups. So re.split(';+' ,'abc;d;;e') returns ['abc', 'd', 'e'] and re.split('(;+)','abc;d;;e') returns ['abc', ';', 'd', ';;', 'e'] Returns a string sub (pattern, repl, text, [,count, flags]) If there is a match between pattern and text, build a string that a) replaces pattern by repl, which is a string that can refer to matched groups via \# (e.g. \1) or \g<#>, (e.g., \g<1>), or \g (where name comes from ?P) b) replaces pattern by the result returned by CALLING repl, which is a function that is passed a match object as its argument) If there is no match, then it just returns the text parameter's value, unchanged re.sub('(a+)','{as}','aabcaaadaf') returns {as}bc{as}d{as}f re.sub('(a+)','(\g<1>)','aabcaaadaf') returns (aa)bc(aaa)d(a)f ----- Side-note Included only in this side-note, but NOT USED in this lecture/course, are the Returns a list of string/of tuples of string (the groups), specifying matches findall (pattern, text [,flags]) Matches can start anywhere in the text; the next attempted match starts one character after the previous match terminates. If the pattern has groups, then the string matching each group is included in the resulting list too: use ?: to avoid these groups re.findall('a*b','abaabcbdabc') returns ['ab', 'aab', 'b', 'ab'] re.findall('((a*)(b))','abaabcbdabc') returns [('ab','a','b'), ('aab','aa','b'), ('b','','b'), ('ab','a','b')] Returns a iterable of the information returned by findall (ignore this one) finditer (pattern, text [,flags]) Returns iterable equivalent of findall subn same as sub but returns a tuple: (new string, number of subs made) escape (string): strng with nonalphanumeric back-slashed ----- In findall and sub/subn, only non-overlapping patterns are found/replaced: in text aaaa there are two non-overlapping occurrence of the pattern aa: starting in index 0 and 2 (not in index 1, which overlaps with the previous match in indexes 0-1). ------------------------------------------------------------------------------ regex (compiled pattern) object methods (see the compile method above, which produces regexes) are called like c = re.compile(p). It is then efficient to call c.match(...) many times. Calling re.match(p,...) many times with the same pattern re-compiles and matches the pattern each time re.match is called; whereas c = re.compile(p) compiles the pattern ONCE and c.match(...) just has to match it against text each time that it is called. Using this feature allows us to compile a pattern and reuse it for all the operations above: re.match(p,s) is just like re.compile(p).match(s); if we are doing MANY MATCHES WITH THE SAME PATTERN, it is most efficient to compile the pattern once and then use the compiled pattern with the match method below many times (as illustrated above). We often compile patterns when using them to match against all the lines in a file. pos/endpos are options that specify where in text the match starts and ends (from pos to endpos-1). pos defaults to 0 (the beginning of the text) and endpos defaults to the length of the text so endpos-1 is its last character). Each of the re functions above has an equivalent method using a compiled pattern to call the method, but omitting the pattern from its argument list. match (text [,pos][,endpos]) See match above, with pos/endpos search (text [,pos][,endpos]) See search above, with pos/endpos findall (text [,pos][,endpos]) See findall above, with pos/endpos finditer (text [,pos][,endpos]) See finditer above, with pos/endpos split (text [,maxsplit]) See split above, with pos/endpos sub (repl, text [,count]) See sub above, with pos/endpos subn (repl, text [,count]) See subn above, with pos/endpos So, for example, instead of writing for line in open_file: ...re.match(pattern_string,line) which implicitly compiles the same pattern_string during each loop iteration (whenever re.match executes) we can write for line in open_file: pattern = re.compile(pattern_string) ...pattern.match(line) which explicitly compiles the pattern_string during each loop and uses the compiled version (instead of the function re.match) to call "match" (just two ways of doing the same thing). NOW...if we know the pattern_string stays the same, we can factor the call to re.compile out before the loop executes and write it as pattern = re.compile(pattern_string) for line in open_file: ...pattern.match(line) which explicitly compiles the pattern_string ONCE, before the loop executes, and calls "match" on it during each loop iteration. This is the most efficient way to test the same pattern against every line in a file. See the grep.py module in the remethods download that accompanies this lecture for code that calls re.compile. ------------------------------------------------------------------------------ Match objects and (Capture) Groups Match objects record information about which parts of a pattern match the text. Each group (referred to by either its number or an optional name) can be used as an argument to a function that specifies information about the start, end, span, and characters in the matching text. Calling match/search produces None or a match object Calling findall produces None, a list of strings (if there are no groups) or a list of tuples of strings (if there are groups, with the tuple index representing the each group #) Calling finditer produces None or an iterable of groups (not used in the course) Each group is indexed by a number or name (a name only when the group was delimited by (?P)); group 0 IS ALL THE CHARACTER IN THE MATCH, groups 1-n are for delimited matches inside. For example, in the pattern (a)(b(c)(d)) the a is in group 1, the b is in group 2 (group 2 includes groups 3 and 4), c is in group 3, and d in is group 4: groups are numbered by in what order we reach their OPENING parenthesis. that is why group 2 includes all of b(c)(d). Note that if a parenthesized expression looks like (?:...) it is NOT numbered as a group. So in (a)(?:b(c)(d)) the a is in group 1, the b is in NO group, c is in group 2, and d in is group 3. If a group is followed by a ? and the pattern in the group is skipped, its group will be None. In the result of re.match('a(b)?c','ac') group 1 will be None. If the group itself is not optional, but the text inside the group is, the group will show as matching an empty string. So the result of re.match('a(b?)c','ac') group 1 will be ''. The same is true for a repetition that matches 0 times. Compare re.match('a(b)*c','ac') group 1 and re.match('a(b*)c','ac') group 1. If a group matches multiple times (e.g., a(.)*c ), only its last match is available, so for axyzc group 1, the (.) group, is bound to the character z. If we wrote this as a(.*)c the (.*) group is bound to the characters xyz. If we wrote it as a((.)*)c group 1 is xyz and grups 2 is just z. Printing the .groups() method called on a match object prints a tuple of the matching characters for each group 1-n (not group #0) We can look at each resulting group by its number (including group #0), using any of the following methods that operate on match objects group(g) text of group with specified name or number group(g1,g2, ... ) tuple of text of groups with specified name or number groups() tuple of text of all groups but #0 (can iterate over tuple) groupdict() text of all groups as dict (see ?P for keys) start([group]) starting index of group (or entire matched string) end([group]) ending index of group (or entire matched string) span([group]) tuple: (start([group]), end([group])) Try doing some matches and calling .groups() on the result. Unzip remethods.zip and examine the phonecall.py and readingtest.py modules for examples of Python programs that use regular expressions (and groups) to perform useful computations. Loose Ends: 1) Raw Strings When writing regular expression pattern strings as arguments in Python it is best to use raw strings: they are written in the form r'...' or r"...". These should be used because of an issue dealing with using the backslash character in patterns, which is sometimes necessary. For example, in normal strings when you write '\n' Python turns that into a 1 character string with the newline character: len('\n') is 1. But with raw strings, writing r'\n' specifies a string with a backslash followed by an n: len(r'\n') is 2. Normally this isn't a big issue because writing '\d' or' \*' in normal strings doesn't generate an escape character, since there is no escape character for d or ( so len('\d') and len('\*') is 2. 2) **d in function/method calls (where d is a dict) If we call a function we can specify **d as one or more of its arguments. For each **d, Python uses all its keys as parameter names and all its values as default arguments for these parameter names. For example f(**{'b':2, 'a':1, 'c':3} ) is translated by Python into f(b=2,a=1,c=3) Note that this is useful in regular expressions if we use the (?P ...) option and then the groupdict() method for the match it produces. There is also a version that works the other way. Suppose we have a functions whose header is def f(x,y,**kargs): # The typical name is **kargs if we call it by f(1,2,a=3,b=4,c=5) then x is bound to 1 y is bound to 2 kargs is bound to a dictionary {'b':4, 'a':3, 'c':5} See the argument/parameter matching rules from the review lectuer for a complete description of what happens. So (in reverse of the order explained above) ** as a parameter creates a dictionary of "extra" named-arguments supplied when the function is called, and ** as an argument supplies lots of named-arguments to the function call. We will cover this information again when we examine inheritance The parse_phone_named method (in phoncecall.py) uses this language feature. 3) Translation of a Regular Expression Pattern into a NDFA How do the functions/methods in re compile a regular expression string and match it against a text string? It translates every regular expression into a non-deterministic finite automaton (see Programming Assignment #1, part 4), and then matches against the text (ibid) to see if the match succeeds (reaches the special last state). The general algorithm (known as Thompson's Algorithm) is a bit beyond the scope of this course and uses a concept we haven't discussed (epsilon -aka empty- transitions), but you can look up the details if you are interested. Here is an example for the regular expression pattern ((a*|b)cd)+. It produces an NDFA described by start;a;1;a;2;b;2;c;3 1;a;1;a;2 2;c;3 3;d;start;d;last last This pattern matches a text string by starting in state 'start' and exhausting all the characters and having 'last' in its possible states at the end. Problems: Write functions using regular expression patterns 8. Write a function named contract that takes a string as a parameter. It substitutes the word 'goal' to replace any occurrences of variants of this word written with any number of o's, e.g., 'gooooal') in its argument. So calling contract('It is a goooooal! A gooal.') returns 'It is a goal! A goal.'. 9. Write a function named grep that takes a regular expression pattern string and a file name as parameters. It returns a list of 3-tuples consisting of the file-name, line number, and line of the file, for each line whose text matches the pattern. Hint: Using enumerate and a comprehension, this is a 3 line function, but you can use explicit looping in a longer function. 10. Write a function named name_convert that takes two file names as parameter. It reads the first file (which should be a Python program) and writes each line into the second file, but with identifiers originally written in camel notation converted to underscore notation: e.g. aCamelName converts to a_camel_name. Camel identifiers start with a lower-case letter followed by upper/lower-case letters and digits: each upper-case letter is preceded by an underscore and turned into a lower-case letter.