Introduction: Regular Expressions are general patterns that we can specify to search and replace text in strings. Python includes a module to perform various operations on regular expressions. In this lecture we will cover the form of regular expressions, what functions can use regular expressions, the how to use the results of matching regular expressions against text (match groups). Python's module for doing these operations is named re. ------------------------------------------------------------------------------ General Matches the most number of characters possible (called a greedy algorithm) Characters generally match themselves, except for the following... Patterns/Metacharacters . Matches any single character RaRb Matches a sequence of Ra followed by Rb Ra|Rb Matches either alternative Ra or Rb [] Matches one character specified in [] [^] Matches one character that is NOT specified in [] after ^; [^aeiouy] - In [], matches any character in range: e.g., a-z all lower case letters R? Matches regular expression R 0 or 1 times: R is optional R* Matches regular expression R 0 or more times R+ Matches regular expression R 1 or more times R{m} Matches regular expression R exactly m times R{m,n} Matches regular expression R at least m and at most n times R??,R*?,R+?,R{m,n}? The postfix ? means match as few characters possible (not greedy) Parentheses are used for grouping, but also to specify remembered subpatterns By placing subpattern R in parentheses, the subpattern matching R will be remembered (either by its number, starting at 1, or its name, if named) in a group, for use later in the pattern or when substituting for the pattern. (R) Matches R and delimits a group (1...) (remembering matched substring) (?:R) Matches R but does not delimit a group (not remembered) (?PR) Matches R and delimits group (remembering matched substring) using name for the group (it is still numbered as well); see (?P=name) and groupdict method below (?P=name) Matches remembered group(substring) named name for backreferencing (?=R) Matches R and delimtes group, but does not consume input match (?!R) Matches anything but R and does not consume input needed for match Anchors (these don't match characters) ^ beginning of line (when not use in []) $ end of line Context - not in [] and not between two characters means - (itself) Special characters are treated as themselve in [] (Generally, if interpretting a character makes no sense one way, try to find another way to interpret it that fits the context) Escapes \ Used to specify .|[]-?*+{}()^$\ and other \# Backreferencing group # (numbered from 1, 2, ...) \t tab \n newline \r carriage return \f formfeed \v vertical tab \d [0-9] digit \D [^0-9] non digit \s [ \t\n\r\f\v] white space \S [^ \t\n\r\f\v] non white space \w [a-zA-Z0-9_] alphabetic (or underscore) \W [^a-zA-Z0-9_] non alphabetic ------------------------------------------------------------------------------ re methods: call like re.match(...) Returns a regex (compiled pattern) object (see how to call regex methods below) compile (pattern, [,flags]) creates compiled pattern object Returns a match object, consisting of tuple of groups (0,1,...) match (pattern, string [,flags]) Matches must start at the beginning search (pattern, string [,flags]) Matches can start anywhere Returns a list of string/of tuples of string (the groups), specifying matches findall (pattern, string [,flags]) Matches can start anywhere Returns a iterable of the information returned by findall (ignore this one) finditer (pattern, string [,flags]) returns iterable equivalent of findall Returns a list of strings: much like calling string.split("...") split (pattern, string [,maxsplit, flags]) like string.split(), but using pattern: re.split(".|-", "a.b-c") returns ['a','b','c'] which string.split(...) can't do (... = ".-" spilts on . followed by - If the pattern has groups, then the string matching each group is include in the resulting list too Returns a string sub (pattern, repl, string, [,count, flags]) in string, replace pattern by repl (which may refer to matched groups via \# (e.g. \1) or \g<#>, (e.g., \g<1>), or \g (where name comes from ?P) or a function that is passed a match object); if there is no match, then returns string subn same as sum but return tuple: (new string,number of subs) escape (string) string with nonalphnum back slashed In findall and sub/subn, only non-overlapping patterns are found/replaced: in text aaaa there are two non-overlapping occurrence of the pattern aa: starting in index 0 and 2 (not in index 1, which overlaps with the previous match in indexes 0-1) ------------------------------------------------------------------------------ regex (compiled pattern) object methods (see the compile method above); call like c = re.compile(p), then many times call c.match(...) This feature allows us to compile a pattern and reuse it for all the operations above: re.match(p,s) is just like re.compile(p).match(s); if we are doing many matches with the same pattern, compile the pattern once and use it with the match method below many times. match (string [,pos][,endpos]) See match above, with pos/endpos search (string [,pos][,endpos]) See search above, with pos/endpos findall (string [,pos][,endpos]) See findall above, with pos/endpos finditer (string [,pos][,endpos]) See finditer above, with pos/endpos split (string [,maxsplit]) See split above, with pos/endpos sub (repl, string [,count]) See sub above, with pos/endpos subn (repl, string [,count]) See subn above, with pos/endpos ------------------------------------------------------------------------------ Match objects and Groups Calling match/search produces None or a match object (a tuple of groups) Calling findall produces None, a list of strings (if there are no groups) or a list of tuples of strings (if there are groups) Calling finditer produces None or an iterable of groups (ignore this one) Each group is indexed by a number or name (when the group was delimited by (?P)); group 0 is all the character in the match, groups 1-n are for delimited matches inside. For example, in the pattern (a)(b(c)) the a is in group 1, the b is in group 2, and c is in group 3: groups are ordered by in what order we reach their opening parenthesis. If a group matches multiple times (e.g., a(.)+c), only its last match is available, so axyzc has group 1, the (.), bound to Printing the match object prints a tuple of the matching characters for each gropu 1-n, not group #0 We can look at each resulting group by its number (including group #0), using any of the following match object methods group(g) text of group with specified name or number group(g1,g2, ... ) tuple of text of groups with specified name or number groups() tuple of text of all groups (can iterate over tuple) groupdict() text of all groups as dict (see ?P) start([group]) staring index of group (or entire matched string) end([group]) ending index of group (or entire matched string) span([group]) tuple: (start([group]), end([group])) Interesting Equivalences a(b|c)d == a[bc]d R{0,1} == R? [ \t]* != ( *|\t*) but does == ( |\t)* For more complete information, see Section 6.2 of the Python Standard Library. (Loose Ends) Raw Strings When writing Python code that specifies string literals as arguments the regular expression methods, it is best to use raw strings: they are written in the form r'...'. The issue deals with using the backslash in pattern, which is sometimes necessary. For example, in regular strings when you write '\n' Python turns that into a 1 character string with the newline character: len('\n') is 1. But with raw strings, writing r'\n' specifies a string with a backslash followed by an n: len(r'\n') is 2. **d (where d is a dict) If we call a function we can specify **d as one or more of its parameters. For each **d, Python uses all its keys as parameter names and all its values as default arguments for these parameter names. For example f(**{'b':2, 'a':1, 'c':3} ) is translated into f(b=2,a=1,c=3) Note that this is useful in regular expressions if we use the (?P ...) options and then the groudict() method for the match it produces. There is also a version that works the other way. Suppose we have a functions whose header is def f(x,y,**kargs): # The typical name is **kargs if we call it by f(1,2,a=3,b=4,c=5) then x is bound to 1 y is bound to 2 kargs is bound to a dictionary {'b':4, 'a':3, 'c':5} See the argument/parameter matching rules for a complete description of what happens. So (in reverse of the order explained above) ** as a parameter creates a dictionary of "extra" named-arguments suppled when the function is called, and ** as an argument supplies lots of named-arguments to the function call. We will cover this information again when we examine inheritance