P[1]P[2] ... P[m]
,
called the pattern, and a string of characters
T[1]T[2] ... T[n]
, called the text, we are
required to return either the first occurence of the pattern in the
text (that is, the smallest k
such that
T[k]=P[1]
, T[k+1]=P[2], ... ,
T[k+m-1]=P[m]
) or a failure message if the pattern does not
occur in the text. One naive way of accomplishing this task goes as
follows. We first align the pattern with the text so that
P[1]
is aligned with some T[k]
(initially
k=1
); then we compare P[i]
with
T[k-1+i]
for i=1,2, ... ,m
; as soon as a
mismatch is found, we increment k
by 1
(which means sliding the pattern to the right by one space) and
iterate. Here is a formal description (with j=k-1+i
):
In the worst case, this algorithm makes as many asi = 1
,j = 1
whilej <= n
do ifP[i] = T[j]
theni = i+1
,j = j+1
elsej = j+2-i
,i = 1
end ifi=m+1
then returnj+1-i
end end return a failure message
m(n+1-m)
character comparisons (consider the pattern AA ... AAB and the text
AA ... AAB); improvements come from the observation that, when a
mismatch is found, we may often slide the pattern to the right by
more than one space and skip unnecessary character comparisons. Three
examples follow.
If the pattern is ABACABAB ... and we discover the mismatch between
P[8]
and T[j]
, then we may slide the
pattern over by four spaces and make P[4]
vs. T[j]
the next comparison. If the pattern is ABACABAC
... and we discover the mismatch between P[8]
and
T[j]
, then we may slide the pattern over by six spaces and make
P[2]
vs. T[j]
our next comparison. If the
pattern is ABACABA ... and we discover the mismatch between
P[7]
and T[j]
, then we may slide the
pattern over by six spaces and make P[1]
vs. T[j+1]
our next comparison.
In general, when we discover a mismatch between P[i]
and
T[j]
, we want to slide the pattern over to the next
position where it could possibly occur within the text. In this new
position, T[j]
is aligned with some P[t]
such that t < i
; the pattern has a chance of matching
the text here only if
this can be the case only ifP[1]P[2] ... P[t]
is a suffix ofT[1]T[2] ... T[j]
;
since we do not want to miss any occurence of the pattern in the text, we choose the largestP[1]P[2] ... P[t-1]
is a suffix ofP[1]P[2] ... P[i-1]
(which is a suffix ofT[1]T[2] ... T[j-1]
)
andP[t]
differs fromP[i]
;
t
(corresponding to
the smallest shift) with this property and we make P[t]
vs. T[j]
our next comparison. If there is no such
t
then we slide the pattern past T[j]
to
align P[1]
with T[j+1]
and we make
P[1]
vs. T[j+1]
our next comparison.
To summarize our findings, let us define next[i]
as
the largest t
such that
settingP[1]P[2] ... P[t-1]
is a suffix ofP[1]P[2] ... P[i-1]
andP[t]
differs fromP[i]
,
next[i]=0
if no such t
exists. Now the improved algorithm can be specified as
or more compactly asi = 1
,j = 1
whilej <= n
do ifP[i] = T[j]
theni = i+1
,j = j+1
elsei = next[i]
ifi=0
theni = 1, j = j+1
end end ifi=m+1
then returnj+1-i
end end return a failure message
Note thati = 1
,j = 1
whilej <= n
do whilei > 0
andP[i]
differs fromT[j]
doi = next[i]
endi = i+1
,j = j+1
ifi=m+1
then returnj+1-i
end end return a failure message
2j-i
increases after each comparison of characters; since
2j-i=1
at the beginning and 2j-i <= 2n+1
at all times, it
follows that the algorithm makes at most 2n
character comparisons,
and so its running time is at most proportional to n
.
The condition j <= n
tested in the (outer)
while loop can be replaced by the more stringent j <=
n-(m-i)
: if P[i]
is aligned with T[j]
and j > n-(m-i)
, then P[m]
is positioned
beyond T[n]
, the end of the text. With this modification,
the algorithm may return a failure message having read neither the
entire text nor the entire pattern (consider the pattern AB ... and
the text AA ... AAA). However, in practice (when m <<
n
), it may be better to avoid testing this while
condition altogether by appending a copy of the pattern to the text
(that is, by setting T[n+i]=P[i]
for all i=1,2,
... , m
): then the algorithm always finds a copy of
the pattern in the text and it returns a failure message if and only
if this copy is the ``sentinel'' T[n+1]T[n+2] ...
T[n+m]
.
There remains the problem of computing the function next[.]
; one
efficient way of doing that goes as follows. We first set
next[1]=0
(which is always the correct value) and then compute the
values of next[j]
with j=2,3, ... m
one by one in m-1
iterations. The iteration in which next[j]
is computed begins with
two copies of the pattern aligned in such a way that
withP[1]P[2] ... P[i-1]
is a suffix ofP[1]P[2] ... P[j-1]
i
less than j
and as large as possible
subject to these constraints (which can make i
as small
as 1); this iteration consists of first computing the value of
next[j]
and then preparing for the next iteration by
decrementing i
(if necessary and by the smallest
necessary amount) to
makeThe algorithm can be specified asP[1]P[2] ... P[i]
a suffix ofP[1]P[2] ... P[j]
.
or, equivalently but with no pair of characters in the pattern compared more than once, asnext[1] = 0
i = 1
,j = 2
whilej <= m
do ifP[i] = P[j]
then next[j]= next[i] else next[j] = i end whilei > 0
andP[i]
differs fromP[j]
doi = next[i]
endi = i+1
,j = j+1
end
Again,next[1] = 0
i = 1
,j = 2
whilej <= m
do ifP[i] = P[j]
then next[j]= next[i] elsenext[j] = i
,i = next[i]
whilei > 0
andP[i]
differs fromP[j]
doi = next[i]
end endi = i+1
,j = j+1
end
2j-i
increases after each comparison of characters; since
2j-i=3
at the beginning and 2j-i <= 2m+1
at all times, it
follows that the algorithm makes at most 2m-2
character comparisons,
and so its running time is at most proportional to m
.
We have defined next[i]
as the largest t
such that
(settingP[1]P[2] ... P[t-1]
is a suffix ofP[1]P[2] ... P[i-1]
andP[t]
differs fromP[i]
next[i]=0
if no such t
exists). Some textbooks drop the requirement that P[t]
differ from P[i]
in this definition, which leads to a
simplification in computing the function next[.]
: the
instructions
if P[i] = P[j]
then next[j]= next[i]
else next[j] = i
end
in the first version is replaced by next[j] = i
. The modified
algorithm remains correct; like our version, it makes at most 2m-1
character comparisons in computing the function next[.]
and at most
2n-1
character comparisons in scanning the text. However, our
version is more efficient: for example, if the pattern is ABCABCACAB
then our values of next[1], next[2], ... ,next[10]
are
0,1,1,0,1,1,0,5,0,1
,
next[1], next[2], ... ,next[10]
equal to
0,1,1,1,2,3,4,5,1,2
.
These notes are based on D. E. Knuth, J. H. Morris, Jr., and V. R. Pratt, "Fast pattern matching in strings", SIAM J. Computing 6 (1977), 323--350.