String matching -- Knuth-Morris-Pratt

(see AHU pp.330-332 or CLR pp.1002-1011)

Given:
    pattern P = p1...pm
    text T = t1...tn

Find:   where P is a substring of T, that is, the minimum i such that ti...ti+m-1 is P
[equivalent to the PL/I function INDEX]

The obvious search algorithm tries to match P starting at the beginning of T and, when the match fails, move P one position and restart.   This algorithm has a worst case time complexity of O(mn).   We will show an O(m + n).

We iterate finding, for each k, the longest suffix of T[1:k] that is a prefix of P.   Let that longest suffix be of length L(k) = j.

Let's say we have already found L(1)...L(k).

To find L(k+1):

If tk+1 = pj+1 then L(k+1) = j+1.
Otherwise...   we will use a failure function,
f(j) = largest q < j such that P[1:q] is a suffix of P[1:j].
Since P[1:j] matches the end of T[1:k], so will P[1:q], and pj+1tk+1, but perhaps pq+1 will = tk+1.

Define f (1) = f(j), and f (i)(j) = f( f (i-1)(j) ).

(Otherwise...  )   If tk+1pj+1 then apply f to j repeatedly until we find a prefix of P that matches the end of T[1:k] and that also matches with tk+1.  

That is, find the minimum i such that
either 1. f (i)(j) = u and tk+1 = pu+1 → L(k+1) = u+1
  or     2. f (i)(j) = 0 and tk+1p1 → L(k+1) = 0

To compute f

f(1) = 0 by definition.   Assume that we have f(1)...f(j).   Let f(j) = q.

To compute f(j+1):

If pj+1 = pq+1 then f(j+1) = f(j) + 1 = q+1 because p1...pq pq+1 = pj-q+1...pj pj+1.

If pj+1pq+1 then find the minimum i such that
either 1. f (i)(j) = u and pj+1 = pu+1f(j+1) = u+1
  or     2. f (i)(j) = 0 and pj+1p1f(j+1) = 0

   Compute failure function f for P = p1...pm
      f[1] := 0
      q := 0
      for j := 2 to m do
         while q > 0 and pq+1 ≠ pj do
1:          q := f[q]
         if pq+1 = pj then
2:          q := q+1
         /** else q = 0 **/
         f[j] := q

q is increased only in statement 2 and can be incremented by one at most m times.   Therefore q can be decremented (by one or more) a total of at most m times.   Thus, the time complexity of this algorithm is O(m).

   Compute match function L for P[1:m] as substring of T[1:n]
      compute failure function f for P
      L[0] := 0
      j := 0
      for k := 1 to n do
         while j > 0 and pj+1 ≠ tk do
            j := f[j]
         if pj+1 = tk then
            j := j+1
         /** else j = 0 **/
         L[k] := j
         if j = m then  /** full match **/
            j := f[j]

Using a time analysis similar to the one above, it is seen that the time complexity of this algorithm is O(m + n).


Dan Hirschberg
Last modified: Aug 23, 2007