I’ve learned Knuth-Morris-Pratt Algorithm (KMP) several times. Several weeks after I learned it, if I am asked to solve the string/pattern matching problem, I only know there’s a KMP algorithm that does it efficiently at the time complexity of O(n+m). How to do it? I need to read it again. So I know I’ll need to write a post for it.

**0. The Algorithm**

Knuth-Morris-Pratt algorithm is named after the three computer scientists who discovered the algorithm. It is an efficient algorithm to solve pattern matching problems. In this post, we’ll just use string matching to illustrate.

Let’s say you have a string “ababaca” of length M, and you want to find whether the string appears in a text “bacbababaabcbab” of length N.

**0.1 The Naive Method**

It’s easy to come out a naive method with two nested loops. We simply start search at each position of the text, and see if the next M characters match the pattern. If there’s a mismatch, we simply move to next position.

You can refer to 1.1 for C implementation of the naive method.

**0.2 KMP**

The KMP algorithm makes use of the information in a matching sub-pattern of a mismatch. This is illustrated as below,

Pattern | a |
b |
a |
b |
a |
c | a | |

Text | bacb | a |
b |
a |
b |
a |
a | abcbab |

When the second mismatch in the example above happens, we have matched the first 5 chars and reached the 10th char of the text. We know the first 5 chars of the pattern is “ababa” and the last 6 chars we scanned in the text is “ababaX”, where X is not “a”. KMP made use of this information to move the pattern.

Pattern | a |
b |
a |
b |
a |
c | a | |

Text | bacbab |
a |
b |
a |
a* | abcbab |

It can be observed that after the move, the suffix “aba” of the previous matched sub-pattern “ababa” matches the prefix of the same sub-pattern (Note that “a” also satisfies the condition, but “aba” is the longest). And the next matching search can start at “a*” position.

In fact, KMP preprocess the pattern to get the **length of the longest prefix in the sub-pattern that matches a suffix in the same sub-pattern.** The number of steps to jump is then decided according to the following

No. of steps to jump = max(1, length of matched sub-pattern in mismatch – preprocessed

value corresponding to the last matched position)

For a detailed discussion of the preprocessing, please refer here.

In our example, the preprocessing will have the following values,

Position | 0 | 1 | 2 | 3 | 4 | 5 | 6 |

Pattern | a | b | a | b | a | c | a |

Pre-process values | 0 | 0 | 1 | 2 | 3 | 0 | 1 |

No. of Steps to jump | max(0,1) | max(0,1) | max(2-0, 1) | max(3-1, 1) | max(4-2, 1) | max(5-3, 1) | max(6-0, 1) |

If we use the pre-computed table above to search for the string “ababaca” in text “bcaababacababaca”, it will be like below,

ab

acaabababaca Jump step = 1

a

acaabababaca Jump step = 1

ab

acaabababaca Jump step = 1

ababac

acaabababaca Jump step = 1

xxababaca

acaabababaca Jump step = 2

=> a match is found

**1. C Code **

**1.1 C Implementation for Naive String Matching**

`#include <stdio.h>`

`#include <string.h>`

int main(void) {

char *pattern = "ababcab";

char *text = "cabdabadcdefababcabef";

` int i = 0, j = 0;`

` int m, n;`

m = strlen(pattern);

n = strlen(text);

` printf("%d:%dn", m, n);`

` for (; i < n - m + 1; ++i) {`

` for (j = 0; j < m; ++j) {`

` if (pattern[j] != text[i + j]) {`

` break;`

}

}

` if (j == m) {`

` //found a match`

` printf("there's a match starting at %d: %sn", i, &(text[i]));`

` break;`

}

}

` if (i == n - m + 1) {`

` printf("no matchn");`

}

` return 0;`

}

Compile the code using

gcc -o naive naive.c

**1.2 C Implementation for KMP**

`#include <stdio.h>`

`#include <stdlib.h>`

`#include <string.h>`

void precompute(char *pattern, int *pre) {

` int plen = strlen(pattern);`

` int i;`

memset(pre, 0, sizeof(int)*plen);

` for (i = 1; i < plen; ++i) {`

pre[i] = pre[i-1];

` while (pattern[i] != pattern[pre[i]]) {`

` if (pre[i] == 0) {`

pre[i] = -1;

` break;`

` } else {`

pre[i] = pre[pre[i]-1];

}

}

pre[i] = pre[i] + 1;

}

}

int main(int argc, char **argv) {

` char *pattern = argv[1];`

` char *text = argv[2];`

` int *pre;`

` int m = strlen(pattern);`

` int n = strlen(text);`

` int i, j, overlap;`

pre = (int*) malloc((m+1)*sizeof(int));

precompute(pattern, pre);

` for (i = 0, j = 0; i < n; ) {`

` while (text[i+j] == pattern[j]) {`

++j;

` if (j == m) {`

` printf("found a match at pos: %dn", i);`

` break;`

}

}

` if (j > 0) {`

i += (j - pre[j-1] > 1) ? (j - pre[j-1]):1;

j = pre[j-1];

` } else {`

++i;

}

}

` return 0;`

}

Compile the code using

gcc -o kmp kmp.c

**2. Sample Run**You can use the following scripts to execute naive implementation and kmp implementation side by side,

#!/bin/sh

`pattern="dsfads"`

`text="sfdkjfldskjalfkdsadsfadsdsfadsfdaklfdjsla;fjdsafkjfdlsajfoiewjfksafdasfdsa"`

./kmp $pattern $text

./naive $pattern $text

`pattern="fdsfadsa"`

`text="fdasklfjdlsajfdsfdsfadsdsafdsaffdsfadsafdsakljfdlsakjflkdsajflkdsajflkdsjalkfjdslkajfkd;lsajfklds;jalkfjd;sa"`

./kmp $pattern $text

./naive $pattern $text

`pattern="ababab"`

`text="ababababababfdasjklabababafdkslajabababafdafabababafdsafababab"`

./kmp $pattern $text

./naive $pattern $text

`pattern="ababaca"`

`text="abababacaabacaabababaabcacafslkfjdslaabacacababacaacaads"`

./kmp $pattern $text

./naive $pattern $text

`pattern="ababacaab"`

`text="abababacaabacaabababaabcacafslkfjdslaabacacababacaacaads"`

./kmp $pattern $text

./naive $pattern $text

`pattern="ABCDABD"`

`text="ABCABCDABABCDABCDABDE"`

./kmp $pattern $text

./naive $pattern $text

`pattern="ABCDABD"`

`text="ABccDAFDSA"`

./kmp $pattern $text

./naive $pattern $text

**3. Naïve VS. KMP**

It can be proved that Naive implementation has the worst case O(mn), while KMP has the worst case O(m+n). Here we add a counting variable to each implementation and run a test case to demonstrate that it’s indeed the case and our KMP implementation runs in O(m+n) time.

The modified code for naive implementation is as below,

`#include <stdio.h>`

`#include <string.h>`

int main(int argc, char **argv) {

` char *pattern = argv[1];`

` char *text = argv[2];`

` int i = 0, j = 0;`

` int m, n;`

` int cmpcnt = 0;`

m = strlen(pattern);

n = strlen(text);

` //printf("%d:%dn", m, n);`

` for (; i < n - m + 1; ++i) {`

` for (j = 0; j < m; ++j) {`

++cmpcnt;

`// printf("%d:%d:%dn", i, j, cmpcnt);`

` if (pattern[j] != text[i + j]) {`

` break;`

}

}

` if (j == m) {`

` //found a match`

` printf("naive: there's a match starting at %d: %sn", i, &(text[i]));`

` //break;`

}

}

` if (i == n - m + 1) {`

` //printf("no matchn");`

}

` printf ("naive (m = %d; n = %d): No. of comparisons performed: %dn", m , n, cmpcnt);`

` return 0;`

}

The modified KMP implementation is as below,

`#include <stdio.h>`

`#include <stdlib.h>`

`#include <string.h>`

`int cmpcnt = 0;`

void precompute(char *pattern, int *pre) {

` int plen = strlen(pattern);`

` int i;`

memset(pre, 0, sizeof(int)*plen);

` for (i = 1; i < plen; ++i) {`

pre[i] = pre[i-1];

++cmpcnt;

` while (pattern[i] != pattern[pre[i]]) {`

` if (pre[i] == 0) {`

pre[i] = -1;

` break;`

` } else {`

pre[i] = pre[pre[i]-1];

}

++cmpcnt;

}

pre[i] = pre[i] + 1;

}

}

int main(int argc, char **argv) {

` char *pattern = argv[1];`

` char *text = argv[2];`

` int *pre;`

` int m = strlen(pattern);`

` int n = strlen(text);`

` int i, j, overlap;`

pre = (int*) malloc((m+1)*sizeof(int));

precompute(pattern, pre);

` for (i = 0, j = 0; i < n; ) {`

++cmpcnt;

` while (text[i+j] == pattern[j]) {`

++j;

` if (j == m) {`

` printf("found a match at pos: %dn", i);`

` break;`

}

++cmpcnt;

}

` if (j > 0) {`

i += (j - pre[j-1] > 1) ? (j - pre[j-1]):1;

j = pre[j-1];

` } else {`

++i;

}

}

` printf("kmp (m=%d; n=%d): total No. of comparisons performed: %dn", m, n, cmpcnt);`

` return 0;`

}

The test script is as below,

#!/bin/sh

`pattern="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"`

`text="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"`

./kmp $pattern $text

./naive $pattern $text

Now the results,

found a match at pos: 786

kmp (m=34; n=820): total No. of comparisons performed: 1671

naive: there’s a match starting at 786: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab

naive (m = 34; n = 820): No. of comparisons performed: 26758

This post doesn’t discuss much of the pre-processing of KMP, I’ve written a separate post for it. You can find it here.

**References:**

1. Introduction to Algorithms, 3rd Edition.

2. Wikipedia, Knuth-Morris-Pratt algorithm: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

3. The Knuth-Morris-Pratt Algorithm in my own words: http://jboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/