I’ve learned Knuth-Morris-Pratt Algorithm (KMP) several times. Several weeks after I learned it, if I am asked to solve the string/pattern matching problem, I only know there’s a KMP algorithm that does it efficiently at the time complexity of O(n+m). How to do it? I need to read it again. So I know I’ll need to write a post for it.

0. The Algorithm

Knuth-Morris-Pratt algorithm is named after the three computer scientists who discovered the algorithm. It is an efficient algorithm to solve pattern matching problems. In this post, we’ll just use string matching to illustrate.

Let’s say you have a string “ababaca” of length M, and you want to find whether the string appears in a text “bacbababaabcbab” of length N.

0.1 The Naive Method

It’s easy to come out a naive method with two nested loops. We simply start search at each position of the text, and see if the next M characters match the pattern. If there’s a mismatch, we simply move to next position.

You can refer to 1.1 for C implementation of the naive method.

0.2 KMP

The KMP algorithm makes use of the information in a matching sub-pattern of a mismatch. This is illustrated as below,

Pattern   a b a b a c a
Text bacb a b a b a a abcbab

When the second mismatch in the example above happens, we have matched the first 5 chars and reached the 10th char of the text. We know the first 5 chars of the pattern is “ababa” and the last 6 chars we scanned in the text is “ababaX”, where X is not “a”. KMP made use of this information to move the pattern.

Pattern   a b a b a c a
Text bacbab a b a a*     abcbab

It can be observed that after the move, the suffix “aba” of the previous matched sub-pattern “ababa” matches the prefix of the same sub-pattern (Note that “a” also satisfies the condition, but “aba” is the longest). And the next matching search can start at “a*” position. 

In fact, KMP preprocess the pattern to get the length of the longest prefix in the sub-pattern that matches a suffix in the same sub-pattern. The number of steps to jump is then decided according to the following

No. of steps to jump = max(1, length of matched sub-pattern in mismatch – preprocessed
value corresponding to the last matched position)

For a detailed discussion of the preprocessing, please refer here.

In our example, the preprocessing will have the following values,

Position 0 1 2 3 4 5 6
Pattern a b a b a c a
Pre-process values 0 0 1 2 3 0 1
No. of Steps to jump max(0,1) max(0,1) max(2-0, 1) max(3-1, 1) max(4-2, 1) max(5-3, 1) max(6-0, 1)

If we use the pre-computed table above to search for the string “ababaca” in text “bcaababacababaca”, it will be like below,
ab                           
acaabababaca                Jump step = 1
a
acaabababaca                Jump step = 1
   ab                                                
acaabababaca                Jump step = 1
     ababac                   
acaabababaca                Jump step = 1
     xxababaca
acaabababaca                Jump step = 2
=> a match is found

1. C Code

1.1 C Implementation for Naive String Matching

#include <stdio.h>

#include <string.h>

 

int main(void) {

    char *pattern = "ababcab";

    char *text = "cabdabadcdefababcabef";

    int i = 0, j = 0;

    int m, n;

 

    m = strlen(pattern);

    n = strlen(text);

    printf("%d:%d\n", m, n);

 

    for (; i < n - m + 1; ++i) {

        for (j = 0; j < m; ++j) {

            if (pattern[j] != text[i + j]) {

                break;

            }

        }

        if (j == m) {

            //found a match

            printf("there's a match starting at %d: %s\n", i, &(text[i]));

            break;

        }

    }

    if (i == n - m + 1) {

        printf("no match\n");

    }

    return 0;

}

Compile the code using

gcc -o naive naive.c

1.2 C Implementation for KMP

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

 

void precompute(char *pattern, int *pre) {

    int plen = strlen(pattern);

    int i;

    memset(pre, 0, sizeof(int)*plen);

    for (i = 1; i < plen; ++i) {

        pre[i] = pre[i-1];

        while (pattern[i] != pattern[pre[i]]) {

            if (pre[i] == 0) {

                pre[i] = -1;

                break;

            } else {

                pre[i] = pre[pre[i]-1];

            }

        }

        pre[i] = pre[i] + 1;

    }

}

 

int main(int argc, char **argv) {

    char *pattern = argv[1];

    char *text = argv[2];

    int *pre;

    int m = strlen(pattern);

    int n = strlen(text);

    int i, j, overlap;

    pre = (int*) malloc((m+1)*sizeof(int));

    precompute(pattern, pre);

    for (i = 0, j = 0; i < n; ) {

        while (text[i+j] == pattern[j]) {

            ++j;

            if (j == m) {

               printf("found a match at pos: %d\n", i);

               break;

            }

        }

        if (j > 0) {

            i += (j - pre[j-1] > 1) ? (j - pre[j-1]):1;

            j = pre[j-1];

        } else {

            ++i;

        }

    }

    return 0;

}

Compile the code using

gcc -o kmp kmp.c

2. Sample Run
You can use the following scripts to execute naive implementation and kmp implementation side by side,

#!/bin/sh

pattern="dsfads"

text="sfdkjfldskjalfkdsadsfadsdsfadsfdaklfdjsla;fjdsafkjfdlsajfoiewjfksafdasfdsa"

./kmp $pattern $text

./naive $pattern $text

 

pattern="fdsfadsa"

text="fdasklfjdlsajfdsfdsfadsdsafdsaffdsfadsafdsakljfdlsakjflkdsajflkdsajflkdsjalkfjdslkajfkd;lsajfklds;jalkfjd;sa"

./kmp $pattern $text

./naive $pattern $text

 

pattern="ababab"

text="ababababababfdasjklabababafdkslajabababafdafabababafdsafababab"

./kmp $pattern $text

./naive $pattern $text

 

pattern="ababaca"

text="abababacaabacaabababaabcacafslkfjdslaabacacababacaacaads"

./kmp $pattern $text

./naive $pattern $text

 

pattern="ababacaab"

text="abababacaabacaabababaabcacafslkfjdslaabacacababacaacaads"

./kmp $pattern $text

./naive $pattern $text

 

pattern="ABCDABD"

text="ABCABCDABABCDABCDABDE"

./kmp $pattern $text

./naive $pattern $text

 

pattern="ABCDABD"

text="ABccDAFDSA"

./kmp $pattern $text

./naive $pattern $text

3. Naïve VS. KMP

It can be proved that Naive implementation has the worst case O(mn), while KMP has the worst case O(m+n). Here we add a counting variable to each implementation and run a test case to demonstrate that it’s indeed the case and our KMP implementation runs in O(m+n) time.

The modified code for naive implementation is as below,

#include <stdio.h>

#include <string.h>

 

int main(int argc, char **argv) {

    char *pattern = argv[1];

    char *text = argv[2];

    int i = 0, j = 0;

    int m, n;

    int cmpcnt = 0;

 

    m = strlen(pattern);

    n = strlen(text);

    //printf("%d:%d\n", m, n);

 

    for (; i < n - m + 1; ++i) {

        for (j = 0; j < m; ++j) {

            ++cmpcnt;

//            printf("%d:%d:%d\n", i, j, cmpcnt);

            if (pattern[j] != text[i + j]) {

                break;

            }

        }

        if (j == m) {

            //found a match

            printf("naive: there's a match starting at %d: %s\n", i, &(text[i]));

            //break;

        }

    }

    if (i == n - m + 1) {

        //printf("no match\n");

    }

    printf ("naive (m = %d;  n = %d): No. of comparisons performed: %d\n", m , n, cmpcnt);

    return 0;

}

The modified KMP implementation is as below,

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

 

int cmpcnt = 0;

 

void precompute(char *pattern, int *pre) {

    int plen = strlen(pattern);

    int i;

    memset(pre, 0, sizeof(int)*plen);

    for (i = 1; i < plen; ++i) {

        pre[i] = pre[i-1];

        ++cmpcnt;

        while (pattern[i] != pattern[pre[i]]) {

            if (pre[i] == 0) {

                pre[i] = -1;

                break;

            } else {

                pre[i] = pre[pre[i]-1];

            }

            ++cmpcnt;

        }

        pre[i] = pre[i] + 1;

    }

}

 

int main(int argc, char **argv) {

    char *pattern = argv[1];

    char *text = argv[2];

    int *pre;

    int m = strlen(pattern);

    int n = strlen(text);

    int i, j, overlap;

    pre = (int*) malloc((m+1)*sizeof(int));

    precompute(pattern, pre);

    for (i = 0, j = 0; i < n; ) {

        ++cmpcnt;

        while (text[i+j] == pattern[j]) {

            ++j;

            if (j == m) {

               printf("found a match at pos: %d\n", i);

               break;

            }

            ++cmpcnt;            

        }

        if (j > 0) {

            i += (j - pre[j-1] > 1) ? (j - pre[j-1]):1;

            j = pre[j-1];

        } else {

            ++i;

        }

    }

    printf("kmp (m=%d; n=%d): total No. of comparisons performed: %d\n", m, n, cmpcnt);

    return 0;

}

The test script is as below,

#!/bin/sh

pattern="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"

text="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"

./kmp $pattern $text

./naive $pattern $text

Now the results,

found a match at pos: 786
kmp (m=34; n=820): total No. of comparisons performed: 1671
naive: there’s a match starting at 786: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
naive (m = 34;  n = 820): No. of comparisons performed: 26758

This post doesn’t discuss much of the pre-processing of KMP, I’ve written a separate post for it. You can find it here.

References:
1. Introduction to Algorithms, 3rd Edition.
2. Wikipedia, Knuth-Morris-Pratt algorithm: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
3. The Knuth-Morris-Pratt Algorithm in my own words: http://jboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/

 

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Set your Twitter account name in your settings to use the TwitterBar Section.