## The Knuth-Morris-Pratt string-matching algorithm

(Lecture notes written by Vašek Chvátal)

Given a string of characters `P[1]P[2] ... P[m]`, called the pattern, and a string of characters `T[1]T[2] ... T[n]`, called the text, we are required to return either the first occurence of the pattern in the text (that is, the smallest `k` such that `T[k]=P[1]`, ```T[k+1]=P[2], ... , T[k+m-1]=P[m]```) or a failure message if the pattern does not occur in the text. One naive way of accomplishing this task goes as follows. We first align the pattern with the text so that `P[1]` is aligned with some `T[k]` (initially `k=1`); then we compare `P[i]` with `T[k-1+i]` for `i=1,2, ... ,m`; as soon as a mismatch is found, we increment `k` by `1` (which means sliding the pattern to the right by one space) and iterate. Here is a formal description (with `j=k-1+i`):
````i = 1`, `j = 1`
while `j <= n`
do    if   `P[i] = T[j]`
then `i = i+1`, `j = j+1`
else `j = j+2-i`, `i = 1`
end
if `i=m+1` then return `j+1-i` end
end
return a failure message
```
In the worst case, this algorithm makes as many as `m(n+1-m)` character comparisons (consider the pattern AA ... AAB and the text AA ... AAB); improvements come from the observation that, when a mismatch is found, we may often slide the pattern to the right by more than one space and skip unnecessary character comparisons. Three examples follow.

If the pattern is ABACABAB ... and we discover the mismatch between `P[8] ` and `T[j]`, then we may slide the pattern over by four spaces and make `P[4]` vs. `T[j]` the next comparison. If the pattern is ABACABAC ... and we discover the mismatch between `P[8]` and ``` T[j]```, then we may slide the pattern over by six spaces and make `P[2]` vs. `T[j]` our next comparison. If the pattern is ABACABA ... and we discover the mismatch between `P[7]` and ` T[j]`, then we may slide the pattern over by six spaces and make `P[1]` vs. `T[j+1]` our next comparison.

In general, when we discover a mismatch between `P[i]` and ` T[j]`, we want to slide the pattern over to the next position where it could possibly occur within the text. In this new position, `T[j]` is aligned with some `P[t]` such that `t < i`; the pattern has a chance of matching the text here only if

`P[1]P[2] ... P[t]` is a suffix of `T[1]T[2] ... T[j]`;
this can be the case only if
`P[1]P[2] ... P[t-1]` is a suffix of `P[1]P[2] ... P[i-1]`
(which is a suffix of `T[1]T[2] ... T[j-1]`)
and `P[t]` differs from `P[i]`;
since we do not want to miss any occurence of the pattern in the text, we choose the largest `t` (corresponding to the smallest shift) with this property and we make `P[t]` vs. `T[j]` our next comparison. If there is no such `t` then we slide the pattern past `T[j]` to align `P[1]` with `T[j+1]` and we make `P[1]` vs. `T[j+1]` our next comparison.

To summarize our findings, let us define `next[i]` as the largest `t` such that

`P[1]P[2] ... P[t-1]` is a suffix of `P[1]P[2] ... P[i-1]`
and `P[t]` differs from `P[i]`,
setting `next[i]=0` if no such `t` exists. Now the improved algorithm can be specified as
````i = 1`, `j = 1`
while `j <= n`
do    if   `P[i] = T[j]`
then `i = i+1`, `j = j+1`
else `i = next[i]`
if `i=0` then `i = 1, j = j+1` end
end
if `i=m+1` then return `j+1-i` end
end
return a failure message
```
or more compactly as
````i = 1`, `j = 1`
while `j <= n`
do    while `i > 0` and `P[i]` differs from `T[j]`
do    `i = next[i]`
end
`i = i+1`, `j = j+1`
if `i=m+1` then return `j+1-i` end
end
return a failure message
```
Note that `2j-i` increases after each comparison of characters; since `2j-i=1` at the beginning and `2j-i <= 2n+1` at all times, it follows that the algorithm makes at most `2n` character comparisons, and so its running time is at most proportional to `n`.

The condition `j <= n` tested in the (outer) while loop can be replaced by the more stringent ```j <= n-(m-i)```: if `P[i]` is aligned with `T[j]` and `j > n-(m-i)`, then `P[m]` is positioned beyond `T[n]`, the end of the text. With this modification, the algorithm may return a failure message having read neither the entire text nor the entire pattern (consider the pattern AB ... and the text AA ... AAA). However, in practice (when ```m << n```), it may be better to avoid testing this while condition altogether by appending a copy of the pattern to the text (that is, by setting `T[n+i]=P[i]` for all ```i=1,2, ... , m```): then the algorithm always finds a copy of the pattern in the text and it returns a failure message if and only if this copy is the ``sentinel'' ```T[n+1]T[n+2] ... T[n+m]```.

There remains the problem of computing the function `next[.]`; one efficient way of doing that goes as follows. We first set `next[1]=0` (which is always the correct value) and then compute the values of `next[j]` with `j=2,3, ... m` one by one in `m-1` iterations. The iteration in which `next[j]` is computed begins with two copies of the pattern aligned in such a way that

`P[1]P[2] ... P[i-1]` is a suffix of `P[1]P[2] ... P[j-1]`
with `i` less than `j` and as large as possible subject to these constraints (which can make `i` as small as 1); this iteration consists of first computing the value of `next[j]` and then preparing for the next iteration by decrementing `i` (if necessary and by the smallest necessary amount) to
make `P[1]P[2] ... P[i]` a suffix of `P[1]P[2] ... P[j]`.
The algorithm can be specified as
````next[1] = 0`
`i = 1`, `j = 2`
while `j <= m`
do    if   `P[i] = P[j]`
then next[j]= next[i]
else next[j] = i
end
while `i > 0` and `P[i]` differs from `P[j]`
do    `i = next[i]`
end
`i = i+1`, `j = j+1`
end
```
or, equivalently but with no pair of characters in the pattern compared more than once, as
````next[1] = 0`
`i = 1`, `j = 2`
while `j <= m`
do    if   `P[i] = P[j]`
then next[j]= next[i]
else `next[j] = i`, `i = next[i]`
while `i > 0` and `P[i]` differs from `P[j]`
do    `i = next[i]`
end
end
`i = i+1`, `j = j+1`
end
```
Again, `2j-i` increases after each comparison of characters; since `2j-i=3` at the beginning and `2j-i <= 2m+1` at all times, it follows that the algorithm makes at most `2m-2` character comparisons, and so its running time is at most proportional to `m`.

WARNING

We have defined `next[i]` as the largest `t` such that

`P[1]P[2] ... P[t-1]` is a suffix of `P[1]P[2] ... P[i-1]`
and `P[t]` differs from `P[i]`
(setting `next[i]=0` if no such `t` exists). Some textbooks drop the requirement that `P[t]` differ from `P[i]` in this definition, which leads to a simplification in computing the function `next[.]`: the instructions
```      if   `P[i] = P[j]`
then next[j]= next[i]
else next[j] = i
end
```
in the first version is replaced by `next[j] = i`. The modified algorithm remains correct; like our version, it makes at most `2m-1` character comparisons in computing the function `next[.]` and at most `2n-1` character comparisons in scanning the text. However, our version is more efficient: for example, if the pattern is ABCABCACAB then our values of `next[1], next[2], ... ,next[10]` are
`0,1,1,0,1,1,0,5,0,1`,
but the modified version has `next[1], next[2], ... ,next[10]` equal to
`0,1,1,1,2,3,4,5,1,2`.

These notes are based on D. E. Knuth, J. H. Morris, Jr., and V. R. Pratt, "Fast pattern matching in strings", SIAM J. Computing 6 (1977), 323--350.