Given a string of characters

`P[1]P[2] ... P[m]`

,
called the `T[1]T[2] ... T[n]`

, called the `k`

such that
`T[k]=P[1]`

, ```
T[k+1]=P[2], ... ,
T[k+m-1]=P[m]
```

) or a failure message if the pattern does not
occur in the text. One naive way of accomplishing this task goes as
follows. We first align the pattern with the text so that
`P[1]`

is aligned with some `T[k]`

(initially
`k=1`

); then we compare `P[i]`

with
`T[k-1+i]`

for `i=1,2, ... ,m`

; as soon as a
mismatch is found, we increment `k`

by `1`

(which means sliding the pattern to the right by one space) and
iterate. Here is a formal description (with `j=k-1+i`

):
In the worst case, this algorithm makes as many as`i = 1`

,`j = 1`

while`j <= n`

doif`P[i] = T[j]`

then`i = i+1`

,`j = j+1`

else`j = j+2-i`

,`i = 1`

endif`i=m+1`

thenreturn`j+1-i`

endendreturn a failure message

`m(n+1-m)`

character comparisons (consider the pattern AA ... AAB and the text
AA ... AAB); improvements come from the observation that, when a
mismatch is found, we may often slide the pattern to the right by
more than one space and skip unnecessary character comparisons. Three
examples follow.
If the pattern is ABACABAB ... and we discover the mismatch between
`P[8] `

and `T[j]`

, then we may slide the
pattern over by four spaces and make `P[4]`

vs. `T[j]`

the next comparison. If the pattern is ABACABAC
... and we discover the mismatch between `P[8]`

and ```
T[j]
```

, then we may slide the pattern over by six spaces and make
`P[2]`

vs. `T[j]`

our next comparison. If the
pattern is ABACABA ... and we discover the mismatch between
`P[7]`

and ` T[j]`

, then we may slide the
pattern over by six spaces and make `P[1]`

vs. `T[j+1]`

our next comparison.

In general, when we discover a mismatch between `P[i]`

and
` T[j]`

, we want to slide the pattern over to the next
position where it could possibly occur within the text. In this new
position, `T[j]`

is aligned with some `P[t]`

such that `t < i`

; the pattern has a chance of matching
the text here only if

this can be the case only if`P[1]P[2] ... P[t]`

is a suffix of`T[1]T[2] ... T[j]`

;

since we do not want to miss any occurence of the pattern in the text, we choose the largest`P[1]P[2] ... P[t-1]`

is a suffix of`P[1]P[2] ... P[i-1]`

(which is a suffix of`T[1]T[2] ... T[j-1]`

)

and`P[t]`

differs from`P[i]`

;

`t`

(corresponding to
the smallest shift) with this property and we make `P[t]`

vs. `T[j]`

our next comparison. If there is no such
`t`

then we slide the pattern past `T[j]`

to
align `P[1]`

with `T[j+1]`

and we make
`P[1]`

vs. `T[j+1]`

our next comparison.
To summarize our findings, let us define `next[i]`

as
the largest `t`

such that

setting`P[1]P[2] ... P[t-1]`

is a suffix of`P[1]P[2] ... P[i-1]`

and`P[t]`

differs from`P[i]`

,

`next[i]=0`

if no such `t`

exists. Now the improved algorithm can be specified as
or more compactly as`i = 1`

,`j = 1`

while`j <= n`

doif`P[i] = T[j]`

then`i = i+1`

,`j = j+1`

else`i = next[i]`

if`i=0`

then`i = 1, j = j+1`

endendif`i=m+1`

thenreturn`j+1-i`

endendreturn a failure message

Note that`i = 1`

,`j = 1`

while`j <= n`

dowhile`i > 0`

and`P[i]`

differs from`T[j]`

do`i = next[i]`

end`i = i+1`

,`j = j+1`

if`i=m+1`

thenreturn`j+1-i`

endendreturn a failure message

`2j-i`

increases after each comparison of characters; since
`2j-i=1`

at the beginning and `2j-i <= 2n+1`

at all times, it
follows that the algorithm makes at most `2n`

character comparisons,
and so its running time is at most proportional to `n`

.
The condition `j <= n`

tested in the (outer)
**while** loop can be replaced by the more stringent ```
j <=
n-(m-i)
```

: if `P[i]`

is aligned with `T[j]`

and `j > n-(m-i)`

, then `P[m]`

is positioned
beyond `T[n]`

, the end of the text. With this modification,
the algorithm may return a failure message having read neither the
entire text nor the entire pattern (consider the pattern AB ... and
the text AA ... AAA). However, in practice (when ```
m <<
n
```

), it may be better to avoid testing this **while**
condition altogether by appending a copy of the pattern to the text
(that is, by setting `T[n+i]=P[i]`

for all ```
i=1,2,
... , m
```

): then the algorithm * always* finds a copy of
the pattern in the text and it returns a failure message if and only
if this copy is the ``sentinel'' ```
T[n+1]T[n+2] ...
T[n+m]
```

.

There remains the problem of computing the function `next[.]`

; one
efficient way of doing that goes as follows. We first set
`next[1]=0`

(which is always the correct value) and then compute the
values of `next[j]`

with `j=2,3, ... m`

one by one in `m-1`

iterations. The iteration in which `next[j]`

is computed begins with
two copies of the pattern aligned in such a way that

with`P[1]P[2] ... P[i-1]`

is a suffix of`P[1]P[2] ... P[j-1]`

`i`

less than `j`

and as large as possible
subject to these constraints (which can make `i`

as small
as 1); this iteration consists of first computing the value of
`next[j]`

and then preparing for the next iteration by
decrementing `i`

(if necessary and by the smallest
necessary amount) to
makeThe algorithm can be specified as`P[1]P[2] ... P[i]`

a suffix of`P[1]P[2] ... P[j]`

.

or, equivalently but with no pair of characters in the pattern compared more than once, as`next[1] = 0`

`i = 1`

,`j = 2`

while`j <= m`

doif`P[i] = P[j]`

thennext[j]= next[i]elsenext[j] = iendwhile`i > 0`

and`P[i]`

differs from`P[j]`

do`i = next[i]`

end`i = i+1`

,`j = j+1`

end

Again,`next[1] = 0`

`i = 1`

,`j = 2`

while`j <= m`

doif`P[i] = P[j]`

thennext[j]= next[i]else`next[j] = i`

,`i = next[i]`

while`i > 0`

and`P[i]`

differs from`P[j]`

do`i = next[i]`

endend`i = i+1`

,`j = j+1`

end

`2j-i`

increases after each comparison of characters; since
`2j-i=3`

at the beginning and `2j-i <= 2m+1`

at all times, it
follows that the algorithm makes at most `2m-2`

character comparisons,
and so its running time is at most proportional to `m`

.

We have defined `next[i]`

as the largest `t`

such that

(setting`P[1]P[2] ... P[t-1]`

is a suffix of`P[1]P[2] ... P[i-1]`

and`P[t]`

differs from`P[i]`

`next[i]=0`

if no such `t`

exists). Some textbooks drop the requirement that `P[t]`

differ from `P[i]`

in this definition, which leads to a
simplification in computing the function `next[.]`

: the
instructions
in the first version is replaced byif`P[i] = P[j]`

thennext[j]= next[i]elsenext[j] = iend

`next[j] = i`

. The modified
algorithm remains correct; like our version, it makes at most `2m-1`

character comparisons in computing the function `next[.]`

and at most
`2n-1`

character comparisons in scanning the text. However, our
version is more efficient: for example, if the pattern is ABCABCACAB
then our values of `next[1], next[2], ... ,next[10]`

are
`0,1,1,0,1,1,0,5,0,1`

,
`next[1], next[2], ... ,next[10]`

equal to
`0,1,1,1,2,3,4,5,1,2`

.

These notes are based on
D. E. Knuth, J. H. Morris, Jr., and V. R. Pratt,
"Fast pattern matching in strings",
* SIAM J. Computing*
**6** (1977), 323--350.

Back to the table of contents of Vašek Chvátal's course notes