This is part 2 of the post “suffix array”, which covers the string matching/searching problem that can be solved by suffix array. You may want to read part 1 first.
Once we have the suffix array, it’s easy to check if another string appears in the input string/text or not. If the string appears in the input text, if will be the prefix of one or more its suffix. Since the suffix array are sorted suffix of the input string, binary search can be applied.
Below is an implementation that finds all occurrence of a string in the input text,
The program accepts two input parameters, the input text and the string to search for.
Compile the code using the command below,
gcc -o sufsearch sufsearch.c
Below is a screenshot of the execution,
Figure 1. Searching Strings using Suffix Array
Suppose the length of the string is m, the input text is n. Note that every comparison of the string with suffix of the input text consumes O(m) and O(logn) comparisons is needed to perform the binary search. Therefore, the run time for the algorithm above is O(mlogn), excluding the cost of building the suffix array.
There’re are better algorithms that use the longest common prefix to help the search process. The run time is O(m + logn). Interested users can refer to reference 6 for details.
Note that the input text can be preprocessed to build the suffix array, and the search target string can be given later. This differs from KMP algorithm that the pattern must be given before the computation can begin. This difference makes them suitable for different applications.
1. Suffix Array, Wikipedia. http://en.wikipedia.org/wiki/Suffix_array
2. Brief Introduction to Suffix Array: http://sary.sourceforge.net/docs/suffix-array.html
3. Algorithms, 4th edition website: http://algs4.cs.princeton.edu/63suffix/
4. Programming Perls. 2nd Edition.
5. libdivsufsort, http://code.google.com/p/libdivsufsort/
6. Suffix Arrays: a new method for On-Line String Searches: http://delivery.acm.org/10.1145/330000/320218/p319-manber.pdf?ip=188.8.131.52&acc=ACTIVE%20SERVICE&CFID=67974067&CFTOKEN=52233823&__acm__=1330495809_b1bcb53c53d3d7a276f870cb1e0fdf89