The Knuth-Morris-Pratt string-matching algorithm

(Lecture notes written by Vašek Chvátal)

Given a string of characters P[1]P[2] ... P[m], called the pattern, and a string of characters T[1]T[2] ... T[n], called the text, we are required to return either the first occurence of the pattern in the text (that is, the smallest k such that T[k]=P[1],

T[k+1]=P[2], ...  ,
T[k+m-1]=P[m]

) or a failure message if the pattern does not occur in the text. One naive way of accomplishing this task goes as follows. We first align the pattern with the text so that P[1] is aligned with some T[k] (initially k=1); then we compare P[i] with T[k-1+i] for i=1,2, ... ,m; as soon as a mismatch is found, we increment k by 1 (which means sliding the pattern to the right by one space) and iterate. Here is a formal description (with j=k-1+i):

i = 1, j = 1 
while j <= n 
do    if   P[i] = T[j] 
      then i = i+1, j = j+1
      else j = j+2-i, i = 1
      end
      if i=m+1 then return j+1-i end 
end
return a failure message

In the worst case, this algorithm makes as many as m(n+1-m) character comparisons (consider the pattern AA ... AAB and the text AA ... AAB); improvements come from the observation that, when a mismatch is found, we may often slide the pattern to the right by more than one space and skip unnecessary character comparisons. Three examples follow.

If the pattern is ABACABAB ... and we discover the mismatch between P[8] and T[j], then we may slide the pattern over by four spaces and make P[4] vs. T[j] the next comparison. If the pattern is ABACABAC ... and we discover the mismatch between P[8] and T[j], then we may slide the pattern over by six spaces and make P[2] vs. T[j] our next comparison. If the pattern is ABACABA ... and we discover the mismatch between P[7] and T[j], then we may slide the pattern over by six spaces and make P[1] vs. T[j+1] our next comparison.

In general, when we discover a mismatch between P[i] and T[j], we want to slide the pattern over to the next position where it could possibly occur within the text. In this new position, T[j] is aligned with some P[t] such that t < i; the pattern has a chance of matching the text here only if

P[1]P[2] ... P[t] is a suffix of T[1]T[2] ... T[j];

this can be the case only if

P[1]P[2] ... P[t-1] is a suffix of P[1]P[2] ... P[i-1]
(which is a suffix of T[1]T[2] ... T[j-1])
and P[t] differs from P[i];

since we do not want to miss any occurence of the pattern in the text, we choose the largest t (corresponding to the smallest shift) with this property and we make P[t] vs. T[j] our next comparison. If there is no such t then we slide the pattern past T[j] to align P[1] with T[j+1] and we make P[1] vs. T[j+1] our next comparison.

To summarize our findings, let us define next[i] as the largest t such that

P[1]P[2] ... P[t-1] is a suffix of P[1]P[2] ... P[i-1]
and P[t] differs from P[i],

setting next[i]=0 if no such t exists. Now the improved algorithm can be specified as

i = 1, j = 1 
while j <= n 
do    if   P[i] = T[j] 
      then i = i+1, j = j+1
      else i = next[i]
           if i=0 then i = 1, j = j+1 end
      end
      if i=m+1 then return j+1-i end 
end
return a failure message

or more compactly as

i = 1, j = 1 
while j <= n 
do    while i > 0 and P[i] differs from T[j]
      do    i = next[i] 
      end
      i = i+1, j = j+1
      if i=m+1 then return j+1-i end 
end
return a failure message

Note that 2j-i increases after each comparison of characters; since 2j-i=1 at the beginning and 2j-i <= 2n+1 at all times, it follows that the algorithm makes at most 2n character comparisons, and so its running time is at most proportional to n.

The condition j <= n tested in the (outer) while loop can be replaced by the more stringent j <= n-(m-i): if P[i] is aligned with T[j] and j > n-(m-i), then P[m] is positioned beyond T[n], the end of the text. With this modification, the algorithm may return a failure message having read neither the entire text nor the entire pattern (consider the pattern AB ... and the text AA ... AAA). However, in practice (when m << n), it may be better to avoid testing this while condition altogether by appending a copy of the pattern to the text (that is, by setting T[n+i]=P[i] for all i=1,2, ... , m): then the algorithm always finds a copy of the pattern in the text and it returns a failure message if and only if this copy is the ``sentinel'' T[n+1]T[n+2] ... T[n+m].

There remains the problem of computing the function next[.]; one efficient way of doing that goes as follows. We first set next[1]=0 (which is always the correct value) and then compute the values of next[j] with j=2,3, ... m one by one in m-1 iterations. The iteration in which next[j] is computed begins with two copies of the pattern aligned in such a way that

P[1]P[2] ... P[i-1] is a suffix of P[1]P[2] ... P[j-1]

with i less than j and as large as possible subject to these constraints (which can make i as small as 1); this iteration consists of first computing the value of next[j] and then preparing for the next iteration by decrementing i (if necessary and by the smallest necessary amount) to

make P[1]P[2] ... P[i] a suffix of P[1]P[2] ... P[j].

The algorithm can be specified as

next[1] = 0 
i = 1, j = 2 
while j <= m 
do    if   P[i] = P[j] 
      then next[j]= next[i]
      else next[j] = i
      end
      while i > 0 and P[i] differs from P[j]
      do    i = next[i] 
      end
      i = i+1, j = j+1
end

or, equivalently but with no pair of characters in the pattern compared more than once, as

next[1] = 0 
i = 1, j = 2 
while j <= m 
do    if   P[i] = P[j] 
      then next[j]= next[i]
      else next[j] = i, i = next[i]
           while i > 0 and P[i] differs from P[j]
           do    i = next[i] 
           end
      end
      i = i+1, j = j+1
end

Again, 2j-i increases after each comparison of characters; since 2j-i=3 at the beginning and 2j-i <= 2m+1 at all times, it follows that the algorithm makes at most 2m-2 character comparisons, and so its running time is at most proportional to m.

WARNING

We have defined next[i] as the largest t such that

P[1]P[2] ... P[t-1] is a suffix of P[1]P[2] ... P[i-1]
and P[t] differs from P[i]

(setting next[i]=0 if no such t exists). Some textbooks drop the requirement that P[t] differ from P[i] in this definition, which leads to a simplification in computing the function next[.]: the instructions

      if   P[i] = P[j] 
      then next[j]= next[i]
      else next[j] = i
      end

in the first version is replaced by next[j] = i. The modified algorithm remains correct; like our version, it makes at most 2m-1 character comparisons in computing the function next[.] and at most 2n-1 character comparisons in scanning the text. However, our version is more efficient: for example, if the pattern is ABCABCACAB then our values of next[1], next[2], ... ,next[10] are 0,1,1,0,1,1,0,5,0,1, but the modified version has next[1], next[2], ... ,next[10] equal to 0,1,1,1,2,3,4,5,1,2.

These notes are based on D. E. Knuth, J. H. Morris, Jr., and V. R. Pratt, "Fast pattern matching in strings", SIAM J. Computing 6 (1977), 323--350.

Back to the table of contents of Vašek Chvátal's course notes