Given:
pattern P = p1...pm
text T = t1...tn
Find: where P is a substring of T, that is,
the minimum i such that
ti...ti+m-1 is P
[equivalent to the PL/I function INDEX]
The obvious search algorithm tries to match P starting at the beginning of T and, when the match fails, move P one position and restart. This algorithm has a worst case time complexity of O(mn). We will show an O(m + n).
We iterate finding, for each k, the longest suffix of T[1:k] that is a prefix of P. Let that longest suffix be of length L(k) = j.
Let's say we have already found L(1)...L(k).
To find L(k+1):
If tk+1 = pj+1
then L(k+1) = j+1.
Otherwise... we will use a failure function,
f(j) = largest q < j such that
P[1:q] is a suffix of P[1:j].
Since P[1:j] matches the end of T[1:k],
so will P[1:q], and
pj+1 ≠
tk+1, but perhaps
pq+1 will = tk+1.
Define f (1) = f(j), and f (i)(j) = f( f (i-1)(j) ).
(Otherwise... ) If tk+1 ≠ pj+1 then apply f to j repeatedly until we find a prefix of P that matches the end of T[1:k] and that also matches with tk+1.
That is, find the minimum i such that
either 1. f (i)(j) = u
and tk+1 = pu+1
→ L(k+1) = u+1
or 2. f (i)(j) = 0
and
tk+1 ≠ p1
→ L(k+1) = 0
To compute f(j+1):
If pj+1 = pq+1 then f(j+1) = f(j) + 1 = q+1 because p1...pq pq+1 = pj-q+1...pj pj+1.
If pj+1 ≠
pq+1 then
find the minimum i such that
either 1. f (i)(j) = u
and pj+1 = pu+1
→ f(j+1) = u+1
or 2. f (i)(j) = 0
and
pj+1 ≠ p1
→ f(j+1) = 0
Compute failure function f for P = p1...pm
f[1] := 0
q := 0
for j := 2 to m do
while q > 0 and pq+1 ≠ pj do
1: q := f[q]
if pq+1 = pj then
2: q := q+1
/** else q = 0 **/
f[j] := q
q is increased only in statement 2 and can be incremented by one at most m times. Therefore q can be decremented (by one or more) a total of at most m times. Thus, the time complexity of this algorithm is O(m).
Compute match function L for P[1:m] as substring of T[1:n]
compute failure function f for P
L[0] := 0
j := 0
for k := 1 to n do
while j > 0 and pj+1 ≠ tk do
j := f[j]
if pj+1 = tk then
j := j+1
/** else j = 0 **/
L[k] := j
if j = m then /** full match **/
j := f[j]
Using a time analysis similar to the one above, it is seen that the time complexity of this algorithm is O(m + n).