Knuth-Morris-Pratt Algorithm (KMP)
I’ve learned Knuth-Morris-Pratt Algorithm (KMP) several times. Several weeks after I learned it, if I am asked to solve the string/pattern matching problem, I only know there’s a KMP algorithm that does it efficiently at the time complexity of O(n+m). How to do it? I need to read it again. So I know I’ll need to write a post for it.
0. The Algorithm
Knuth-Morris-Pratt algorithm is named after the three computer scientists who discovered the algorithm. It is an efficient algorithm to solve pattern matching problems. In this post, we’ll just use string matching to illustrate.
Let’s say you have a string “ababaca” of length M, and you want to find whether the string appears in a text “bacbababaabcbab” of length N.
0.1 The Naive Method
It’s easy to come out a naive method with two nested loops. We simply start search at each position of the text, and see if the next M characters match the pattern. If there’s a mismatch, we simply move to next position.
You can refer to 1.1 for C implementation of the naive method.
0.2 KMP
The KMP algorithm makes use of the information in a matching sub-pattern of a mismatch. This is illustrated as below,
| Pattern | a | b | a | b | a | c | a | |
| Text | bacb | a | b | a | b | a | a | abcbab |
When the second mismatch in the example above happens, we have matched the first 5 chars and reached the 10th char of the text. We know the first 5 chars of the pattern is “ababa” and the last 6 chars we scanned in the text is “ababaX”, where X is not “a”. KMP made use of this information to move the pattern.
| Pattern | a | b | a | b | a | c | a | |
| Text | bacbab | a | b | a | a* | abcbab |
It can be observed that after the move, the suffix “aba” of the previous matched sub-pattern “ababa” matches the prefix of the same sub-pattern (Note that “a” also satisfies the condition, but “aba” is the longest). And the next matching search can start at “a*” position.
In fact, KMP preprocess the pattern to get the length of the longest prefix in the sub-pattern that matches a suffix in the same sub-pattern. The number of steps to jump is then decided according to the following
No. of steps to jump = max(1, length of matched sub-pattern in mismatch – preprocessed
value corresponding to the last matched position)
For a detailed discussion of the preprocessing, please refer here.
In our example, the preprocessing will have the following values,
| Position | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| Pattern | a | b | a | b | a | c | a |
| Pre-process values | 0 | 0 | 1 | 2 | 3 | 0 | 1 |
| No. of Steps to jump | max(0,1) | max(0,1) | max(2-0, 1) | max(3-1, 1) | max(4-2, 1) | max(5-3, 1) | max(6-0, 1) |
If we use the pre-computed table above to search for the string “ababaca” in text “bcaababacababaca”, it will be like below,
ab
acaabababaca Jump step = 1
a
acaabababaca Jump step = 1
ab
acaabababaca Jump step = 1
ababac
acaabababaca Jump step = 1
xxababaca
acaabababaca Jump step = 2
=> a match is found
1. C Code
1.1 C Implementation for Naive String Matching
#include <stdio.h>
#include <string.h>
int main(void) {
char *pattern = "ababcab";
char *text = "cabdabadcdefababcabef";
int i = 0, j = 0;
int m, n;
m = strlen(pattern);
n = strlen(text);
printf("%d:%d\n", m, n);
for (; i < n - m + 1; ++i) {
for (j = 0; j < m; ++j) {
if (pattern[j] != text[i + j]) {
break;
}
}
if (j == m) {
//found a match
printf("there's a match starting at %d: %s\n", i, &(text[i]));
break;
}
}
if (i == n - m + 1) {
printf("no match\n");
}
return 0;
}
Compile the code using
gcc -o naive naive.c
1.2 C Implementation for KMP
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void precompute(char *pattern, int *pre) {
int plen = strlen(pattern);
int i;
memset(pre, 0, sizeof(int)*plen);
for (i = 1; i < plen; ++i) {
pre[i] = pre[i-1];
while (pattern[i] != pattern[pre[i]]) {
if (pre[i] == 0) {
pre[i] = -1;
break;
} else {
pre[i] = pre[pre[i]-1];
}
}
pre[i] = pre[i] + 1;
}
}
int main(int argc, char **argv) {
char *pattern = argv[1];
char *text = argv[2];
int *pre;
int m = strlen(pattern);
int n = strlen(text);
int i, j, overlap;
pre = (int*) malloc((m+1)*sizeof(int));
precompute(pattern, pre);
for (i = 0, j = 0; i < n; ) {
while (text[i+j] == pattern[j]) {
++j;
if (j == m) {
printf("found a match at pos: %d\n", i);
break;
}
}
if (j > 0) {
i += (j - pre[j-1] > 1) ? (j - pre[j-1]):1;
j = pre[j-1];
} else {
++i;
}
}
return 0;
}
Compile the code using
gcc -o kmp kmp.c
2. Sample Run
You can use the following scripts to execute naive implementation and kmp implementation side by side,
#!/bin/sh
pattern="dsfads"
text="sfdkjfldskjalfkdsadsfadsdsfadsfdaklfdjsla;fjdsafkjfdlsajfoiewjfksafdasfdsa"
./kmp $pattern $text
./naive $pattern $text
pattern="fdsfadsa"
text="fdasklfjdlsajfdsfdsfadsdsafdsaffdsfadsafdsakljfdlsakjflkdsajflkdsajflkdsjalkfjdslkajfkd;lsajfklds;jalkfjd;sa"
./kmp $pattern $text
./naive $pattern $text
pattern="ababab"
text="ababababababfdasjklabababafdkslajabababafdafabababafdsafababab"
./kmp $pattern $text
./naive $pattern $text
pattern="ababaca"
text="abababacaabacaabababaabcacafslkfjdslaabacacababacaacaads"
./kmp $pattern $text
./naive $pattern $text
pattern="ababacaab"
text="abababacaabacaabababaabcacafslkfjdslaabacacababacaacaads"
./kmp $pattern $text
./naive $pattern $text
pattern="ABCDABD"
text="ABCABCDABABCDABCDABDE"
./kmp $pattern $text
./naive $pattern $text
pattern="ABCDABD"
text="ABccDAFDSA"
./kmp $pattern $text
./naive $pattern $text
3. Naïve VS. KMP
It can be proved that Naive implementation has the worst case O(mn), while KMP has the worst case O(m+n). Here we add a counting variable to each implementation and run a test case to demonstrate that it’s indeed the case and our KMP implementation runs in O(m+n) time.
The modified code for naive implementation is as below,
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv) {
char *pattern = argv[1];
char *text = argv[2];
int i = 0, j = 0;
int m, n;
int cmpcnt = 0;
m = strlen(pattern);
n = strlen(text);
//printf("%d:%d\n", m, n);
for (; i < n - m + 1; ++i) {
for (j = 0; j < m; ++j) {
++cmpcnt;
// printf("%d:%d:%d\n", i, j, cmpcnt);
if (pattern[j] != text[i + j]) {
break;
}
}
if (j == m) {
//found a match
printf("naive: there's a match starting at %d: %s\n", i, &(text[i]));
//break;
}
}
if (i == n - m + 1) {
//printf("no match\n");
}
printf ("naive (m = %d; n = %d): No. of comparisons performed: %d\n", m , n, cmpcnt);
return 0;
}
The modified KMP implementation is as below,
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int cmpcnt = 0;
void precompute(char *pattern, int *pre) {
int plen = strlen(pattern);
int i;
memset(pre, 0, sizeof(int)*plen);
for (i = 1; i < plen; ++i) {
pre[i] = pre[i-1];
++cmpcnt;
while (pattern[i] != pattern[pre[i]]) {
if (pre[i] == 0) {
pre[i] = -1;
break;
} else {
pre[i] = pre[pre[i]-1];
}
++cmpcnt;
}
pre[i] = pre[i] + 1;
}
}
int main(int argc, char **argv) {
char *pattern = argv[1];
char *text = argv[2];
int *pre;
int m = strlen(pattern);
int n = strlen(text);
int i, j, overlap;
pre = (int*) malloc((m+1)*sizeof(int));
precompute(pattern, pre);
for (i = 0, j = 0; i < n; ) {
++cmpcnt;
while (text[i+j] == pattern[j]) {
++j;
if (j == m) {
printf("found a match at pos: %d\n", i);
break;
}
++cmpcnt;
}
if (j > 0) {
i += (j - pre[j-1] > 1) ? (j - pre[j-1]):1;
j = pre[j-1];
} else {
++i;
}
}
printf("kmp (m=%d; n=%d): total No. of comparisons performed: %d\n", m, n, cmpcnt);
return 0;
}
The test script is as below,
#!/bin/sh
pattern="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
text="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
./kmp $pattern $text
./naive $pattern $text
Now the results,
found a match at pos: 786
kmp (m=34; n=820): total No. of comparisons performed: 1671
naive: there’s a match starting at 786: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
naive (m = 34; n = 820): No. of comparisons performed: 26758
This post doesn’t discuss much of the pre-processing of KMP, I’ve written a separate post for it. You can find it here.
References:
1. Introduction to Algorithms, 3rd Edition.
2. Wikipedia, Knuth-Morris-Pratt algorithm: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
3. The Knuth-Morris-Pratt Algorithm in my own words: http://jboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/
Leave a Reply Cancel reply
40% Discount on My Book — Android NDK Cookbook
Android NDK Cookbook ebook 40% discount with promotion code MREANC40 at Packt Publishing The promotion code is valid until 15th June.Categories
- Android Apps (18)
- Android Audio Editor (1)
- TS 2 (3)
- Video Converter Android (8)
- Video2Gif (1)
- Android Tutorial (26)
- Android Dev Tools (1)
- API illustrated (8)
- Multimedia API (3)
- ffmpeg on Android (4)
- NDK (6)
- UI (5)
- Animation (1)
- Code Snippet (2)
- Coding Beyond Technique (18)
- a word, a world (4)
- Bug Rectified (4)
- Programming Habit (1)
- Software as a Career (1)
- Software as User Experience (1)
- Compilers and Related (2)
- ELF (2)
- Computer Languages (31)
- C/C++ (13)
- Java (9)
- JavaScript (2)
- PHP (1)
- Python (8)
- Data Structure & Algorithms (29)
- Bits (1)
- Data Structure (5)
- Integers (10)
- BigInteger (1)
- Prime (4)
- Search (3)
- Sorting (5)
- Strings (5)
- Database (1)
- SQLite (1)
- Digital Signal Processing (33)
- Distributed Systems (17)
- Apache Cassandra (6)
- Apache Hadoop (8)
- Apache Avro (3)
- Apache Nutch (3)
- Apache Solr (1)
- Linux Study Notes (40)
- crontab (1)
- Linux Kernel Programming (8)
- Linux Programming (12)
- IPC (2)
- Linux Network Programming (5)
- Linux Signals (2)
- Linux Shell Scripting (1)
- ssh (3)
- Machinery (30)
- misc (1)
- My Ideas (1)
- My Project (3)
- Mobile Caching (1)
- Selective Decoding (2)
- My Publication (1)
- My Readings (1)
- Networking (15)
- Program for Performance (8)
- Uncategorized (1)
- Virtual Machine (2)
- Web Dev (8)
- web components (3)
- Android Apps (18)
Recent Comments
Archives
- May 2013 (1)
- April 2013 (1)
- March 2013 (4)
- December 2012 (2)
- November 2012 (6)
- October 2012 (6)
- September 2012 (3)
- August 2012 (13)
- July 2012 (15)
- June 2012 (3)
- May 2012 (8)
- April 2012 (4)
- March 2012 (13)
- February 2012 (19)
- January 2012 (9)
- December 2011 (11)
- November 2011 (12)
- October 2011 (4)
- September 2011 (12)
- August 2011 (16)
- July 2011 (15)
- June 2011 (6)
- May 2011 (10)
- April 2011 (13)
- March 2011 (20)
- February 2011 (4)
- November 2010 (2)
- May 2010 (1)
- April 2010 (1)
- February 2010 (1)




