KMP (Knuth-Morris-Pratt) Algorithm Pre-computation
A More General Problem
The pre-computation of KMP finds the longest suffix of a sub-pattern P[0..j] (where j is [0, m-1], m is the length of the pattern) that is also the prefix of this sub-pattern.
A more general problem is to find the longest suffix of a pattern X[0..lx-1], which is also the prefix of a pattern Y[0..ly-1]. If we have solutions for this more general problem, then we simply set both X and Y to P[0..j], and compute for all P[0..j], where j is in the range of [0, m-1]. So we start by solving the more general problem.
Tow Key Observations
The key to solve the problem is to notice that if there’s a pattern Z1 which is suffix of pattern X and prefix of pattern Y, but not the longest one, then Z1 is also a suffix of the longest one Zmax. For example, let X be “abefdef “, Y be “efdefg”, then “ef” is suffix of X and prefix of Y, but not the longest one. The longest one Zmax should be “efdef”. Then “ef” is a suffix of “efdef”.
In other words, if we have found the longest suffix of X and prefix of Y “efdef”, we let “efdef” be X, and repeat the finding process for new X “efdef” and Y “efdefg”, then we find “ef” (note that “efdef” itself doesn’t count). The process and go on and on, and it’s actually recursive.
The description above can be summarized as below,
If we can find the longest suffix of X and prefix of Y, using method sufpre(X, Y), we can find all suffix (longest, 2nd longest, 3nd longest, in this order…) of X and prefix of Y by applying the same method recursively X = sufpre(X, Y), until X is empty.
In pseudocode, the above can be expressed as,
while (x ! =empty) {
x = sufpre(x, y);
record x
}
Another important observation is that to find the longest suffix of X[0..lx-1] and prefix of Y[0..ly-1], we can find the same for X[0..lx-2] and Y[0..ly-1] first, say it’s Y[0..a]. And if the last character of X matches with the character at a+1 of Y. The longest suffix of X[0..lx-1] and prefix of Y[0..ly-1] is simply Y[0..a+1].
For example, let X be “abcbcf”, Y be “bcbcfg. First we find that the longest suffix of X[0..lx-2] “abcbc” and prefix of Y[0..ly-1] “bcbcfg” is “bcbc”. Then the last character of X is “f”, which matches with the a+1 character of Y, which is also “f”. We say that the longest suffix of X[0..lx-1] and prefix of Y[0..ly-1] is Y[0..a+1], which is “bcbcf”.
If the last character of X doesn’t match with the a+1 character of Y, we use the first observation to find a shorter suffix of X[0..lx-2] and Y[0..ly-1], say Y[0..b], where b < a. Then try adding the last character of X again by compare last character of X with b+1 character of Y. The process repeats until we found next longest suffix that can produce a valid match with the adding of last character of X or until there’s no more suffix of X and prefix of Y.
For example, we change the X to “abcbcb”, and let Y still be “bcbcfg”. First we find the longest suffix of X[0..lx-2] “abcbc” and prefix of Y[0..ly-1] “bcbcfg” is “bcbc” (Y[0..a]). Then we add the last character of X, let X be “abcbcb”, and the last character “b” doesn’t match with the a+1 character of Y “f”. Then we find the shorter suffix of X[0..lx-2] and Y[0..ly-1] is “bc” (Y[0..b]). And the last character of X “b” matches with the b+1 character of Y “b”. So the longest suffix of X and prefix of y is “bcb”.
The description above can be summarized as below,
The longest suffix of X[0..lx-1] and prefix Y[0..ly-1], say Zmax, can be found based on the longest suffix of X[0..lx-2] and prefix Y[0..ly-1], say Za=Y[0..a].
- If X[lx-1] is same as Y[a+1], then Zmax = Y[0..a+1].
- Otherwise, the next longest suffix of X[0..lx-2] and prefix Y[0..ly-1], say Za’ = Y[0..a’] (a’ < a) is checked. The process repeats until Za’ is empty.
The description above can be expressed as pseudocode below,
Z = sufpre(X[0..lx-2], Y[0..ly-1]) //Z = Y[0..a]
while (X[lx-1] != Y[a+1]) {
if (Z == “”) break; //a = -1;
else Z = sufpre(Z, Y[0..ly-1])
}
return Y[0..a+1];
The method is recursive. However, dynamic programming can be applied here to to remember the sufpre value for all X[0..j], where j is [0..lx-2]. Then the computation of sufpre for X[0..lx-1] doesn’t require recursive calls.
Now we’re ready to implement the pre-computation for KMP.
C Implementation
Follow the pseudo code above and replace the recursive call with dynamic programming, we can have the code below,
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char **argv) {
char *pattern = argv[1];
int plen = strlen(pattern);
int i;
int *pre;
pre = (int*) malloc(sizeof(int)*plen);
pre[0] = 0;
for (i = 1; i < plen; ++i) {
pre[i] = pre[i-1];
while (pattern[i] != pattern[pre[i]]) {
if (pre[i] == 0) {
pre[i] = -1;
break;
} else {
pre[i] = pre[pre[i]-1]; //pre[i] is the length, pre[i]-1 is the right index to use
}
}
pre[i] = pre[i] + 1;
}
for (i = 0; i < plen; ++i) {
printf("%d:%d\n", i, pre[i]);
}
return 0;
}
Note that this implementation accepts an input string and calculate the length of longest suffix of its substring which is also the prefix of the same substring.
I also found another implementation from reference 5 and modified it to do KMP pre-computation only. The code is as below,
#include<stdio.h>
#include<string.h>
void kmp_table(char *ptr_i, int word_length);
int T[50];
int main(int argc, char **argv)
{
char* ptr_i = argv[1];
kmp_table(ptr_i,strlen(ptr_i));
}
void kmp_table(char *ptr_i,int word_length)
{
int i,j;
i=1;
j=0;
T[0]=0;
for(;i<word_length; )
{
if(*(ptr_i+i) == *(ptr_i+j))
{
T[i]=j+1;
i=i+1;
j=j+1;
}
else if(j>0)
{
j=T[j-1];
}
else
{
T[i]=0;
i=i+1;
}
}
for (i = 0; i < word_length; ++i) {
printf("%d:%d\n", i, T[i]);
}
}
Save my implementation to prekmp.c and modified reference 5 implementation to kk.c and compile them.
gcc -o prekmp prekmp.c
gcc -o kk kk.c
Verification
If you’re using Linux, I also have this batch file for run the two programs and compare the results of the two.
#!/bin/sh
rm pre1.txt
rm pre2.txt
input="ababab"
./prekmp $input >> pre1.txt
./kk $input >> pre2.txt
input="ababcabc"
./prekmp $input >> pre1.txt
./kk $input >> pre2.txt
input="abcdefgh"
./prekmp $input >> pre1.txt
./kk $input >> pre2.txt
input="aaaaaaaaa"
./prekmp $input >> pre1.txt
./kk $input >> pre2.txt
diff pre1.txt pre2.txt
You can add as many test cases as you want. As long as the diff command at last line doesn’t detect different between two output files, it means the two programs produces same results.
I’ve written a separate post here for KMP.
References:
1. Knuth-Morris-Pratt String Matching: http://www.ics.uci.edu/~eppstein/161/960227.html
2. Introduction to Algorithms, 3rd Edition.
3. Wikipedia, Knuth-Morris-Pratt algorithm: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
4. The Knuth-Morris-Pratt Algorithm in my own words: http://jboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/
5. KMP C++ Implementation: http://www.oneous.com/Tutorial-Content.php?id=24
Leave a Reply Cancel reply
40% Discount on My Book — Android NDK Cookbook
Android NDK Cookbook ebook 40% discount with promotion code MREANC40 at Packt Publishing The promotion code is valid until 15th June.Categories
- Android Apps (18)
- Android Audio Editor (1)
- TS 2 (3)
- Video Converter Android (8)
- Video2Gif (1)
- Android Tutorial (27)
- Android Dev Tools (1)
- API illustrated (8)
- Multimedia API (3)
- ffmpeg on Android (4)
- NDK (6)
- UI (6)
- Animation (2)
- Code Snippet (2)
- Coding Beyond Technique (18)
- a word, a world (4)
- Bug Rectified (4)
- Programming Habit (1)
- Software as a Career (1)
- Software as User Experience (1)
- Compilers and Related (2)
- ELF (2)
- Computer Languages (31)
- C/C++ (13)
- Java (9)
- JavaScript (2)
- PHP (1)
- Python (8)
- Data Structure & Algorithms (29)
- Bits (1)
- Data Structure (5)
- Integers (10)
- BigInteger (1)
- Prime (4)
- Search (3)
- Sorting (5)
- Strings (5)
- Database (1)
- SQLite (1)
- Digital Signal Processing (33)
- Distributed Systems (17)
- Apache Cassandra (6)
- Apache Hadoop (8)
- Apache Avro (3)
- Apache Nutch (3)
- Apache Solr (1)
- Linux Study Notes (40)
- crontab (1)
- Linux Kernel Programming (8)
- Linux Programming (12)
- IPC (2)
- Linux Network Programming (5)
- Linux Signals (2)
- Linux Shell Scripting (1)
- ssh (3)
- Machinery (30)
- misc (1)
- My Ideas (1)
- My Project (3)
- Mobile Caching (1)
- Selective Decoding (2)
- My Publication (1)
- My Readings (1)
- Networking (15)
- Program for Performance (8)
- Uncategorized (1)
- Virtual Machine (2)
- Web Dev (8)
- web components (3)
- Android Apps (18)
Recent Comments
Archives
- May 2013 (2)
- April 2013 (1)
- March 2013 (4)
- December 2012 (2)
- November 2012 (6)
- October 2012 (6)
- September 2012 (3)
- August 2012 (13)
- July 2012 (15)
- June 2012 (3)
- May 2012 (8)
- April 2012 (4)
- March 2012 (13)
- February 2012 (19)
- January 2012 (9)
- December 2011 (11)
- November 2011 (12)
- October 2011 (4)
- September 2011 (12)
- August 2011 (16)
- July 2011 (15)
- June 2011 (6)
- May 2011 (10)
- April 2011 (13)
- March 2011 (20)
- February 2011 (4)
- November 2010 (2)
- May 2010 (1)
- April 2010 (1)
- February 2010 (1)




