Suffix Array Part 3 — Longest Common Substring (LCS)

Suffix array can be used to find longest common substring (LCS) among multiple strings. This post illustrate the algorithm by an example of finding LCS among two strings.

Algorithm

The idea is that LCS for two strings is the longest common prefix of some suffices of the two string. Suppose we have two strings abc and bcd, then the suffix of the two strings will be {abc, bc, c} and {bcd, cd, d} respectively. If we sort them according to alphabetical order, it will be rearranged as below,

Figure 1. Sorted Suffices of Both abc and bcd

It is not difficult to tell that the LCS is bc, which is longest common prefix of the two suffices from different strings.

This observation leads to the algorithm below,

1. combine two input strings into one single string
2. compute the suffix array
3. find the longest common prefix of two neighboring suffix that start with suffix of different input strings

Using the same example abc and bcd, we combine the strings to get a single string abcbcd, then the algorithm can be carried out as figure below,

Figure 2. LCS Algorithm using Suffix Array

There are two prefix of the neighboring suffices that start with suffix of different input strings, bc of {bcbcd, bcd} and c of {cbcd, cd}. The longer one is bc. Therefore the LCS of abc and bcd is bc.

Implementation

Below is a sample implementation of the algorithm described above in C,

/**

find longest common substring of two input texts using suffix array

*/

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

 

int pstrcmp(const void *a, const void *b) {

    //printf("pstrcmp: %s %sn", (const char*)*(char **)a, (const char*)*(char **)b);

    return strcmp((const char *)*(char **)a, (const char *)*(char **)b);

}

 

int lcp(char *a, char *b) {

    int len1, len2;

    int i;

    len1 = strlen(a);

    len2 = strlen(b);

    for (i = 0; (i < len1) && (i < len2); ++i) {

        if (a[i] != b[i]) break;

    }

    return i;

}

 

int main(int argc, char **argv) {

    int i;

    char **ap;      //suffix pointers array

    int len1, len2;

    char *cstr;

    int lcslen = 0, lcplen, lcssufpos = -1;

    len1 = strlen(argv[1]);

    len2 = strlen(argv[2]);

    ap = (char**)malloc((len1+len2)*sizeof(char*));

    cstr = (char*)malloc((len1+len2)*sizeof(char));

    cstr = strcat(argv[1], argv[2]);

    for (i = 0; i < len1+len2; ++i) {

        ap[i] = &(cstr[i]);

    }

    for (i = 0; i < len1+len2; ++i) {

        printf("%sn", ap[i]);

    }

    printf("sort the sufficesn");

    qsort(ap, len1+len2, sizeof(char *), pstrcmp);

    for (i = 0; i < len1+len2; ++i) {

        printf("%sn", ap[i]);

    }

    printf("finding the lcsn");

    for (i = 0; i < len1 + len2 - 1; ++i) {

        if ((ap[i] - cstr >= len1) && (ap[i+1] - cstr >= len1)) {

            //both starts with suffix of second string

            continue;

        } else if ((ap[i] - cstr < len1) && (ap[i+1] - cstr < len1)) {

            //both starts with suffix of first string

            continue;

        } else {

            lcplen = lcp(ap[i], ap[i+1]);

            if (lcplen > lcslen) {

                lcslen = lcplen;

                lcssufpos = i;

            }

        }

    }

    if (lcssufpos == -1) {

        printf("no lcsn");

    } else {

        printf("%.*sn", lcslen, ap[lcssufpos]);

    }

}

Compile the code using command below,

gcc -o suflcs suflcs.c

Then a sample execution,

Figure 3. Execution of LCS Program

Note that the algorithm can be extended to more than two input strings, say K strings. Then the problem becomes finding the longest common prefix of neighboring K suffices which start at suffix of K input strings.

References:
1. http://algs4.cs.princeton.edu/63suffix/LCS.java.html

0 thoughts on “Suffix Array Part 3 — Longest Common Substring (LCS)”

  1. I was wondering if you have any example to handle prefix and suffix substrings. In other words, concatenating two strings by the overlapping suffix of the first string with the prefix of the second, i.e. the LCS of the suffix and prefix. Dynamic programming may be needed, but I am not sure. Hope you could elaborate this, if possible. Thanks!

  2. One example would be XABCDAABCD and BCDEF ==>XABCDAABCDEF
    One suffix of XABCDAABCD is: BCD
    One prefix of BCDEF is BCD, so that merged string is XABCDAABCDEF

Leave a Reply

Your email address will not be published. Required fields are marked *