A More General Problem

The pre-computation of KMP finds the longest suffix of a sub-pattern P[0..j] (where j is [0, m-1], m is the length of the pattern) that is also the prefix of this sub-pattern.

A more general problem is to find the longest suffix of a pattern X[0..lx-1], which is also the prefix of a pattern Y[0..ly-1]. If we have solutions for this more general problem, then we simply set both X and Y to P[0..j], and compute for all P[0..j], where j is in the range of [0, m-1]. So we start by solving the more general problem.

Tow Key Observations

The key to solve the problem is to notice that if there’s a pattern Z1 which is suffix of pattern X and prefix of pattern Y, but not the longest one, then Z1 is also a suffix of the longest one Zmax. For example, let X be “abefdef  “, Y be “efdefg”, then “ef” is suffix of X and prefix of Y, but not the longest one. The longest one Zmax should be “efdef”. Then “ef” is a suffix of “efdef”.

In other words, if we have found the longest suffix of X and prefix of Y “efdef”, we let “efdef” be X, and repeat the finding process for new X “efdef” and Y “efdefg”, then we find “ef” (note that “efdef” itself doesn’t count). The process and go on and on, and it’s actually recursive.

The description above can be summarized as below,

If we can find the longest suffix of X and prefix of Y, using method sufpre(X, Y), we can find all suffix (longest, 2nd longest, 3nd longest, in this order…) of X and prefix of Y by applying the same method recursively X = sufpre(X, Y), until X is empty.

In pseudocode, the above can be expressed as,

while (x ! =empty) {
   x = sufpre(x, y);
   record x
}

Another important observation is that to find the longest suffix of X[0..lx-1] and prefix of Y[0..ly-1],  we can find the same for X[0..lx-2] and Y[0..ly-1] first, say it’s Y[0..a]. And if the last character of X matches with the character at a+1 of Y. The longest suffix of X[0..lx-1] and prefix of Y[0..ly-1] is simply Y[0..a+1].

For example, let X be “abcbcf”, Y be “bcbcfg. First we find that the longest suffix of X[0..lx-2] “abcbc” and prefix of Y[0..ly-1] “bcbcfg” is “bcbc”. Then the last character of X is “f”, which matches with the a+1 character of Y, which is also “f”. We say that the longest suffix of X[0..lx-1] and prefix of Y[0..ly-1] is Y[0..a+1], which is “bcbcf”.

If the last character of X doesn’t match with the a+1 character of Y, we use the first observation to find a shorter suffix of X[0..lx-2] and Y[0..ly-1], say Y[0..b], where b < a. Then try adding the last character of X again by compare last character of X with b+1 character of Y. The process repeats until we found next longest suffix that can produce a valid match with the adding of last character of X or until there’s no more suffix of X and prefix of Y.

For example, we change the X to “abcbcb”, and let Y still be “bcbcfg”. First we find the longest suffix of X[0..lx-2] “abcbc” and prefix of Y[0..ly-1] “bcbcfg” is “bcbc” (Y[0..a]). Then we add the last character of X, let X be “abcbcb”, and the last character “b” doesn’t match with the a+1 character of Y “f”. Then we find the shorter suffix of X[0..lx-2] and Y[0..ly-1] is “bc” (Y[0..b]). And the last character of X “b” matches with the b+1 character of Y “b”. So the longest suffix of X and prefix of y is “bcb”.

The description above can be summarized as below,

The longest suffix of X[0..lx-1] and prefix Y[0..ly-1], say Zmax, can be found based on the longest suffix of X[0..lx-2] and prefix Y[0..ly-1], say Za=Y[0..a].

  • If X[lx-1] is same as Y[a+1], then Zmax = Y[0..a+1].
  • Otherwise, the next longest suffix of X[0..lx-2] and prefix Y[0..ly-1], say Za’ = Y[0..a’] (a’ < a) is checked. The process repeats until Za’ is empty.

The description above can be expressed as pseudocode below,

Z = sufpre(X[0..lx-2], Y[0..ly-1])                       //Z = Y[0..a]
while (X[lx-1] != Y[a+1]) {
   if (Z == “”) break;                                         //a = -1;
   else Z = sufpre(Z, Y[0..ly-1])
}
return Y[0..a+1];

The method is recursive. However, dynamic programming can be applied here to to remember the sufpre value for all X[0..j], where j is [0..lx-2]. Then the computation of sufpre for X[0..lx-1] doesn’t require recursive calls.

Now we’re ready to implement the pre-computation for KMP.

C Implementation

Follow the pseudo code above and replace the recursive call with dynamic programming, we can have the code below,

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

 

int main(int argc, char **argv) {

    char *pattern = argv[1];

    int plen = strlen(pattern);

    int i;

    int *pre;

    pre = (int*) malloc(sizeof(int)*plen);

    pre[0] = 0;

    for (i = 1; i < plen; ++i) {

        pre[i] = pre[i-1];

        while (pattern[i] != pattern[pre[i]]) {

            if (pre[i] == 0) {

                pre[i] = -1;

                break;

            } else {

                pre[i] = pre[pre[i]-1];                //pre[i] is the length, pre[i]-1 is the right index to use

            }

        }

        pre[i] = pre[i] + 1;

    }

    for (i = 0; i < plen; ++i) {

        printf("%d:%d\n", i, pre[i]);

    }

    return 0;

}

Note that this implementation accepts an input string and calculate the length of longest suffix of its substring which is also the prefix of the same substring.

I also found another implementation from reference 5 and modified it to do KMP pre-computation only. The code is as below,

#include<stdio.h>

#include<string.h>

 

void kmp_table(char *ptr_i, int word_length);

int T[50];

 

int main(int argc, char **argv)

{

char* ptr_i = argv[1];

kmp_table(ptr_i,strlen(ptr_i));

}

 

 

void kmp_table(char *ptr_i,int word_length)

{

 int i,j;

  i=1;

  j=0;

  T[0]=0;

  for(;i<word_length; )

  {

        if(*(ptr_i+i) == *(ptr_i+j))

           {

            T[i]=j+1;

            i=i+1;

            j=j+1;

           }

 

        else if(j>0)

            {

             j=T[j-1];

            }

 

        else

            {

             T[i]=0;

             i=i+1;

            }

 }

 for (i = 0; i < word_length; ++i) {

    printf("%d:%d\n", i, T[i]);

}

 

}

Save my implementation to prekmp.c and modified reference 5 implementation to kk.c and compile them.

gcc -o prekmp prekmp.c
gcc -o kk kk.c

Verification
If you’re using Linux, I also have this batch file for run the two programs and compare the results of the two. 

#!/bin/sh

rm pre1.txt

rm pre2.txt

input="ababab"

./prekmp $input >> pre1.txt

./kk $input >> pre2.txt

 

input="ababcabc"

./prekmp $input >> pre1.txt

./kk $input >> pre2.txt

 

input="abcdefgh"

./prekmp $input >> pre1.txt

./kk $input >> pre2.txt

 

input="aaaaaaaaa"

./prekmp $input >> pre1.txt

./kk $input >> pre2.txt

 

diff pre1.txt pre2.txt

You can add as many test cases as you want. As long as the diff command at last line doesn’t detect different between two output files, it means the two programs produces same results.

I’ve written a separate post here for KMP.

References:
1. Knuth-Morris-Pratt String Matching: http://www.ics.uci.edu/~eppstein/161/960227.html
2. Introduction to Algorithms, 3rd Edition.
3. Wikipedia, Knuth-Morris-Pratt algorithm: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
4. The Knuth-Morris-Pratt Algorithm in my own words: http://jboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/
5. KMP C++ Implementation: http://www.oneous.com/Tutorial-Content.php?id=24

 

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Set your Twitter account name in your settings to use the TwitterBar Section.